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ABSTRACT 



Most research on information utilization in judgment 
and decision making has followed two basic approaches: " regression 0 
and n Eayesian • " Each has characteristic tasks and characteristic 
information that must be processed to accomplish these tasks. There 
has been a tendency to work within a single approach with minimal 
communication between the resultant subgroups of workers. This 
analysis of the approaches examines (a) the models developed for 
prescribing and describing the use of information; (b) the major 
experimental paradigms, including the types of judgment, prediction, 
and decision tasks and the kinds of information that have been 
available to the decision maker; (c) the key independent variables 
that have been manipulated; and (d) the major empirical results ard 
conclusions. Topics discussed include the configural use of 
information, task or environmental determinants of information 
utilization, learning to use information, sequential effects upon 
information processing, strategies for combiming information, and 
techniques for aiding the decision- maker . Of particular interest is 
the degree to which the specific models and methods characteristic of 
different paradigms have directed attention to certain problem areas 
to the neglect of other equally important problems. Also of interest 
is whether a researcher studying a particular substantive problem 
could increase his understanding by employing other models and 
experimental methods. By laying bare the similarities and differences 
between each approach cross-methcd research may be facilitated. A 
comprehensive bibliography is provided- (Author/GS) 
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Comparison of Bayesian and Regression Approaches 
to the Study of Information Processing in Judgment 1 

Paul Slovic 
and 

Sarah Lichtenstein • 

Oregon Research Institute 

Our concern in this paper is with human judgment and decision making 
and, in particular, with the processing of information that precedes 
and determines these activities. The distinction between judgments and 
decisions is a tenuous one and will not be maintained here; we shall use 
these terms synonymously* 

Regardless of terminology, one thing is certain. Judgment is a 
fundamental cognitive activity that vitally effects the well being — or 
more accurately, the survival — of every human being. Decisions are 
frighteningly more important and more difficult than ever before. Ancient 
man’s most important decisions concerned his personal survival and only 
a limited number of alternatives were available to him. Technological 
innovation has placed modern man in a situation where his decisions now 
control the fate of large population masses, sometimes the whole earth, and 
his sights are now set on outer space. No less important are the multitude 
of personal decisions made by individuals and affecting only themselves and 
a few others. 

The difficulties attendant to decision making are usually blamed on 
the inadequacy of the available information, and, therefore, our technological 
sophistication has been mobilized to remedy this problem. Devices proliferate 
to supply the professional decision maker with an abundance of elegant 
data. Consider, for example, the sophisticated electronic sensors that 
provide information to the physician or the satellites relaying 
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masses of strategic data for military intelligence. However, the problem 
of interpreting and integrating this information has received surprisingly 
little attention. The decision maker is typically left to his own devices to 
utilize information to its best advantage. More likely than not he will 
proceed, as will the physician, businessman, or military commander, in much the 
same manner that has been relied upon since antiquity — intuition. And 
when you ask him what distinguishes a good judge from a poor one he might 
reply, 

"It's a kind of locked in concentration, an intuition, a feel, nothing 
that can be schooled 11 (Smith, 1968, p. 20). 

But things have begun to change. Specialists from many disciplines 
have started tc focus on the integration process itself. Their efforts 
center around two broad questions — "What is the decision maker doing with 
the information available to him?" and "What should he be doing with it?" 

The first is a psychological problem — that of understanding how man uses 
information and relating this knowledge to the mainstream of cognitive 
psychology. The second problem is a more practical one and involves the 
attempt to make decision making more effective and efficient. 

The most significant changes have been brought about by the advent 
and widespread availability of the digital computer. Anyone who thinks 
hard about the problems of integrating information into decisions wonders 
about the degree to which computerized systems can alleviate them* It seems 
obvious that effective automation of decision making requires knowledge 
concerning the operations best performed by man and those best done 
mechanically,, 

The Focus of This Paper 



Information processing occurs at several levels. Our concern here is 
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not with microscopic events at the neural level but rather with cognitive 
operations performed on such grosser phenomena as symbols, signs, and 
facts* We shall focus on the processes and strategies that humans employ 
in order, to integrate these discrete items of information into a unitary 
decision. These are the deliberative processes commonly referred to by 
the terms "weighing," "balancing," or "trading off" information, and they 
include the activity kno as inductive inference. 

Prior to 1960 there was relatively little research on human information 
processing at this molar, judgmental level. Some notable exceptions include 
Brunswik f s pioneering studies of inference in uncertain environments 
(Brunswik, 1956), the work on "probability learning" (Estes, 1959), Edwards 1 
(1953, 1954a, b, c) investigations into gambling decisions, Miller’s (1956) 
elaboration of the limitations on the number of conceptual items that can 
be processed at one time, the concept formation studies by Bruner, Goodnow, 
and Austin (1956), and the research on computer simulation of thought by 
Newell, Shaw, and Simon (1958). 

Since I960, the intellectual heritage of this early work has been 
supplemented by more than 600 studies within the rather narrowly defined 
topic of information utilization in judgment and decision making. The 
yearly volume of studies has been increasing exponentially, stimulated by a 
growing awareness of the importance of the problems and the aid of the 
ubiquitous computer. The importance of the latter cannot be overestimated. 
When Smedslund (1955) published the first multiple cue probability learning 
study, he bemoaned having to compute 3200 correlations on a desk calculator. 
It’s not surprising that the next study of its kind was not forthcoming for 
about five more years. 

Much of the recent work has been accomplished within two basic schools 
of research. We have chosen to call these the "regression" and the "Bayesian" 
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approaches. Each has its characteristic tasks and characteristic information 
that must be processed to accomplish these tasks. For the most part, 
researchers have tended to work strictly within a single approach and there 
has been minimal communication between the resultant subgroups of workers* 

Our objective in this chapter is to present a comparative analysis of 
these two broad methods of approach. Within each, we shall examine (a) 
the models that have been developed for prescribing and describing the 
use of information in decision making; (b) the major experimental paradigms, 
including the types of judgment, prediction, and decision tasks and the 
kinds of information that have been available to the decision maker in these 
tasks; (c) the key independent variables that have been manipulated in 
experimental studies; and (d) the major empirical results and conclusions. 

Of particular interest to us is the degree to which the specific 
models and methods characteristic of different paradigms have directed the 
researcher’s attention to certain problem areas and caused him to neglect 
other problems that are equally iir^ortant. Another question of interest is 
whether a researcher studying a particular substantive problem, such as the 
use of inconsistent or conflicting information, could increase his under- 
standing by employing other models and experimental methods. We hope that by 
laying bare the similarities and differences between each approach we can 
facilitate such cross-methods research. 

Areas of Neglect 

Limitations of space and of our own information-processing capabilities 
have forced us to neglect several other paradigms that have made significant 
contributions to the study of human judgment. One of these is the process- 
tracing approach described by Hayes (1968) and exemplified by the work of 
Kleinmuntz (1968) and Clarkson (1962). Researchers following this approach 
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attempt to build sequential, branching models of the decision maker based 
upon detailed analysis of his verbalizations as he works through actual 
decision problems* 

Yet another important approach to the study of judgment uses multi- 
dimensional scaling procedures to infer the cognitive structure of the judge. 

For a detailed coverage of this work the reader is referred to the chapter 
by Nancy Wiggins in this book. 

There have been several attempts to apply information theory to the 
study of human judgment . One of the most notable efforts along these lines 
is the work of Bieri, Atkins, Briar, Leaman, Miller, and Tripodi (1966) which 
examines the transmission of information in social judgment along the lines 
of Miller’s (1956) well-known paradigm. 

Another area we shall neglect here is that of probability learning, 
because it has been reviewed before and because it provides the decision 
maker with minimal opportunity to integrate information. 

Lastly, we have not attempted to review signal detection theory, a 
Bayesian approach that has produced a great deal of research concerning the 
integration of sensory information into decisions. The reader is referred 
to books by Swets (1964) and Green and Swets (1966) for detailed coverage 
of this area. 

The Regression Approach 

The regression approach is so named by us because of its characteristic 
use of multiple regression, and its close relative, analysis of variance (ANOVA) 
to study the use of information by a judge. Within this broad approach we 
shall distinguish two different paradigms which we have labeled the "correla- 
tional" paradigm and the "functional measurement" paradigm. 
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The Correlational Paradigm 

The correlational paradigm is characterized by its use of correlational 

statistics to describe a Judge* s integration of inherently probabilistic 

information. The basic approach requires the judge to make quantitative 

evaluations of a number of stimulus objects, each of which is defined by 

one or more quantified cue dimensions or characteristics. For example, a 

judge might be asked to predict the grade point average for each of a group 

of college students on the basis of high school grades and aptitude test 

scores. Sarbin and Bailey (1966) elaborate the hopes of the correlational 

analyst in a study such as this : 

n He correlates the information cues available to the 
inferring person with the judgments or inferences. . . . 

What usually results is that the coefficients of 
correlation between cues and judgment make public the 
subtle , and often unreportable , inferential activities 
of the inferring person. That is, the coefficients 
reveal the relative degrees that the judgments 
depend on the various sources of information avail- 
able to the judge*’ (pp. 193-194). 

The development of the correlational paradigm has followed two streams. 
One stream has focused on the judge; its goal is to describe the judge*s 
idiosyncratic method of combining and weighting information by developing 
mathematical equations representative of his combinatorial processes 
(Hoffman, 1960). 

The other stream developed out of the work of Egon Brunswik, whose 
philosophy of ’’probabilistic functionalism 17 led him to study the organism’s 
successes and failures in an uncertain world. Brunswik’s main emphasis 
was not on the organism itself, but on the adaptive interrelationship 
between the organism and its environment. Thus, in addition to studying 
the degree to which a judge used cues , he analyzed the manner in which 
the judge learned the characteristics of his environment* He developed 
the ’’lens model” to represent the probabilistic interrelations between 



O 

ERIC 



7 



organismic and environmental components of the judgment situation (Brunswik, 
1952, 1956) . 

Because of his concern about the environmental determinants of 
judgment, Brunswik was also the foremost advocate of what he called 
Representative design* 11 The essence of this principle is that the organism 
should be studied in realistic settings, in experiments that are represen- 
tative of its usual ecology. The lens model provides a means for appro- 
priately specifying the structure of the situational variables in such 
an experiment* 

The lens model* The lens model has proved to be an extremely valuable 
framework for conceptualizing the judgment process* Hammond (1955) described 
the relevance of the model for the study of clinical judgment, and recent 
work by Hursch, Hammond, and Hursch (1964), Tucker (1964), and Dudycha and 
Naylor (1966) has detailed some important relationships among its components 
in ;;erms of multiple regression statistics* A diagrammatic outline of a 
recent version of the lens model (taken from Dudycha 6 Naylor) is shown in 
Figure 1* The variables X^, X^j *♦* X^ are cues or information sources 
that define each stimulus object. For example, if the stimuli being evaluated 
are students whose grade point averages are to be predicted, the X*^ can 
represent high school rank, aptitude scores, etc* The cue dimensions must 



Insert Figure 1 about here 



be quantifiable, if only to the extent of a 0-1 (e.g. , high-low, yes-no) 
coding. Each cue dimension has a specific degree of relevance to the true 
state of the world* This true state, also called the criterion value, is 
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STIMULUS 

DIMENSIONS 




MATCHING INDEX 



Fa;. 1. Diagram of Lens Model showing the relationship among the cues, criteria, and subjects’ responses. 



(Taken from Dudycha and Naylor, 1966). 
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designated Y (for example, the student’s actual grade point average). The 
til 

relevance of the i information source is indicated by the correlation, 

r. , across stimuli, between cue X. and Y . This value, r . , is called 

1 ,e l e l ,e 

the ecological validity of the i^ cue . The intercorrelations among cues , 

again across stimuli, are given by the r. . values. On the subject’s side, 

1 »D 

his response or judgment is Y (the judged grade point average), and the 

s 

th 

correlation of his judgments; with the i cue is r. , also known as his 

X , s 

th 

utiliztion coefficient for the i cue . 

Both the criterion and the judgment can be predicted from additive 
linear combinations of the cues as indicated by the following regression 
equations : 



Y = E b, 
6 i=l ‘ 



X. 
i,e i 



( 1 ) 



Y = E b. 



i=l 



i,s 1 



( 2 ) 



Equation (1) represents the prediction strategy that is optimal in 

A 

the sense of minimizing the sum of squared deviations between Y and Y q . . 

The multiple correlation coefficient, r g = r^ y , indicates the degree to 

e e 

which the weighted combination of cues serves to predict the state of Y . 

Equation (2) represents the subject’s decision-making strategy or 

policy . The multiple correlation coefficient, r s = r y y » is a measure of 

s s 

how well his judgments can be predicted by a linear combination of cue 

values. It is also known as the subject’s response consistency . The values 

of b, and b. provide measures of the importance of each cue in the 
x ,e x ,s 

ecology and for the judge. 

The two most important summary measures of the judge’s performance are: 

r ft = Ty y , the achievement index, and 
e s 
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r = r„ * , the matching index, 
mix 
e s 

All of the above equations apply to linearly predictable relations and 
dependencies. The model has been further expanded by Hursch, et al. to 

express non-linear cue utilization by the introduction or the C coefficient., 

C is the correlation between the residual which cannot be linearly predicted 
in the criterion and the residual which cannot be linearly predicted in 
the judgment. If either of these residuals is random, C will be zero. 

Tucker (1964) has shown that the indices of the lens model are related 
in a general equation for achievement: 

h 

V = r r r + C (i-r 2 )(l-r 2 ) . (3) 

o esm L e si 

Equation (3) plays an extremely important role in many empirical studies 

and has come to be called the lens model equation. It demonstrates that 

achievement is a function of the statistical properties of the environment 

(r ), as well as the statistical properties of the subject's response 
e 

system (r ) , the extent to which the linear weightings of the two systems 
s 

match one another (r ), and the extent to which nonlinear variance of one 

m 

system is correlated with nonlinear variance of the other (C). As Hammond 

(1966) notes, the lens model permits a precise analysis of the relative 

contributions of environmental factors to a judge’s achievement and thus 

serves as a valuable adjunct to research in the Brunswikian tradition. 

Mathematical models of the judge. As we have seen, the lens model was 

developed to study the effects of the decision maker’s environment on his 

performance. Because of this environmental emphasis, the focal component 

of the model is r , the achievement index. Workers following the other 
a 

stream of correlational research have had a different emphasis. They have 
been more interested in the judge’s weighting process — his policy. In 
contrast with the Brunswikian tradition, they have placed less importance 
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upon modeling the environment and, instead, have stressed the need to control 
the environmental situation. They tend to make the stimulus dimensions 
explicit and to very their levels systematically, even though some degree of 
realism may be lost in the process. Hoffman (1960) rationalizes the need 
for experimental control as follows: 

. . restricting the situation [bjr controlling the 
stimuli] assures that each person is evaluated with 
respect to the same information. Ambiguous and equivocal 
cues are removed, and all judges are thereby certain to 
have at their disposal the same information and no 
more. The inferences made beyond this point are certain 
to have their origins in the data provided" (p. 118). 

A wide variety of mathematical models have been developed to capture 

the judge* s policy* The first and most prominent of these is the linear 

model (Hoffman, 1960) as exemplified by Equation 2 of the lens model. 

Alternatively, when the judge is classifying stimuli into one of two categories, 

the linear discriminant function , rather than the multiple regression equation, 

can be used to analyze the way that cues at'e weighted (Rodwan 6 Hake, 1964). 

In either form, the model captures the notion that the judge *s predictions 

are a simple linear combination of each of the available cues. When judgment 

is represented by the linear model, the b. values of Equation 2 and the 

1 , s 

. utilization coefficients, r. , are used to represent the relative impor- 

1 ,s 

tance given each cue. Hoffman (1960) proposed another index, "relative 
weight," designed for this purpose. Relative weights are computed as 
follows: 



RW. 

1,8 



5. r . 

1 , S 1 , s 



Since the sum of relative weights is 1.0, Hoffman’s index can be used to 
describe the relative contribution of each of the predictors as a proportion 
of the predictable linear variance. 



O 

ERIC 



12 



Darlington (1968) has recently pointed out the unfortunate fact that 
indices of relative weight become suspect when the factors are inter- 
correlated. This problem has led many researchers to work with sets of 
stimuli across which the cues are made orthogonal to one another. One device 
used to insure orthogonality has been to construct stimuli by producing 
factorial combinations of the cues. Of course, this practice is anathema to 
Brunswikians , being the antithesis of representative design. Brunswik 
observed (1955; pp. 204-205) that factorial designs may produce certain 
combinations of values that are incompatible in nature or otherwise unreal- 
istic and disruptive of the very process they were meant to disclose* This 
criticism cannot be taken lightly and, as we shall see, some evidence does 
exist that judgment processes differ as a function of cue interrelationships. 
But, for the researcher who is primarily interested in relative weights, 
rather than in r^, orthogonal designs often seem preferable to those in 
which the cues are representatively correlated. Attempts are usually made, 
however, to mitigate potential disruptive effects by telling the judge that 
he will be dealing with a selected, rather than a random, sample of cjases 
and by eliminating combinations of factors that are obviously unreal (see, 
for example, Hoffman, Slovic, £ Rorer, 1968). 

As we shall see, the linear model does a remarkably good job of 
predicting human judgments. However, judges 1 verbal introspections indicate 
their belief that they use cues in a variety of nonlinear ways , and researchers 
have attempted to capture these with more complex equations. One type of 
nonlinearity occurs when an individual cue relates to the judgments in a 
curvilinear manner. For example, this quote from a leading authority on 
the stock market suggests a curvilinear relation between the volume of 
trading on a stock and its future prospects: 
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. "If you are driving a car you can get to your 
destination more quickly at 50 mph than at 10 
mph. But you may wreck the car at 100 mph. In 
a similar way, increasing volume on an advance 
up to a point is bullish and decreasing volume 
on a rally is bearish, but in both cases only 
up to a point" (Loeb, 1965, p. 287). 

Curvlinear functions such as this quote suggests can be modeled by including 

2 3 

exponential terms (i.e., X^, X^, etc.) as predictors in tho. judge’s policy 
equation. 

A second type of nonlinearity occurs when cues are combined in a 
configural manner. Configurality means that the judge’s interpretation or 
weighting of an item of information varies according to the nature of other 
available information. An excellent example of configural reasoning involving 
price changes, volume of trading, and market cycle is given by the same 
stock market expert: 

"Outstanding strength or weakness can have precisely 
opposite meanings at different times in the market 
. cycle. For example, consistent strength and volume 
in a particular issue, occurring after a long general 
decline, will usually turn out to be an extremely 
bullish indication » • • • On the other hand, after 
an extensive advance which finally spreads to issues 
neglected all through the bull market, belated 
individual strength and activity not only are likely 
to be shortlived but may actually suggest the end of 
the general recovery. ..." (Loeb, 196>5, p* 65). 

When professional decision makers stare that their judgments are 
associated with complex, sequential, and interrelated rules, chances are 
they are referring to some sort of configural process. It is important, 
therefore, that techniques used to describe judgment be sensitive to configur- 
ality. The C coefficient, described earlier, is rather unsatisfactory from 
a descriptive standpoint because of its lack of specificity. For example, 
a random, a linear, or an invalid configural strategy will each result in a 
C value of zero. Even a non-zero C coefficient does not indicate the form of 
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the nonlinear cue utilization. 

One way of making the linear model sensitive to configural effects has 
been to incorporate cross-product ; terms into the policy equation of the judge. 
Thus, if +he meaning of factor X^ varies as a function of the level of 
factor X^, the term have to be added to the equation. When 

models become this complex, however, the proliferation D f highly-intercor- 
related terms in the equations becomes so great that estimation of the 
weighting coefficients becomes unreliable unless vast numbers of cases are 
available (Hoffman, 1968). For this reason investigators such as Hoffman, 
Slavic, and Rorer (1968), Rorer, Hoffman, Dickman, and Slovic (1967), and 
Slovic (1969) have turned to the use of analysis of variance (ANOVA) to 
describe complex judgment processes. 

The structural model underlying ANOVA is quite similar to that of 
multiple regression, both being alternative formulations of a general linear 
model (Cohen, 1968). However, the ANOVA model typically imposes two impor- 
tant restrictions on the factors that describe the cases being judged: (a_) 

the levels of the factors must be categorical (e«,g., good vs. average vs. poor 
up vs. down, etc.) rather than continuous variables; and (b) the factors 
must be orthogonal. The usual way to produce orthogonality is to construct 
all possible combinations of the cue levels in a completely crossed factorial 
design. In return for these restrictions, the ANOVA model efficiently sorts 
the information about linear and nonlinear judgment processes into non- 
overlapping and meaningful portions. 

When judgments are analyzed in terms of an ANOVA model, a significant 
main effect for cue X 1 implies that the judges responses varied systematically 
with X 1 as the levels of the other cues were held constant. Provided 
sufficient levels of the factor were included in the design, the main effect 
may be divided into effects due to linear, quadratic, cubic, etc., trends. 
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Similarly, a significant interaction between cues and implies that 
the judge was responding to particular patterns of those cues ; that is , the 
effect of variation of cue X^ upon judgment differed as a function of the 
corresponding level taken by cue X^. 

The ANOVA model thus has potential for describing the linear, curvi- 
linear, and configural aspects of the judgment process. Within the framework 
of the model, it is possible to calculate an index of the importance of 
individual or patterned use of a cue, relative to the importance of othfer 
cues. The index described by Hays (1963, pp. 324, 382, 407), provides 
an estimate of the proportion of the total variation in a person’s judgments 
that can be predicted from a knowledge of the particular levels of a given 
cue or a pattern of cues. It includes linear and nonlinear variance and, 
therefore, it is analogous to, but more general than, Hoffman’s index of 
relative weight. 

One difficulty with the ANOVA technique is that a complete crossing of 
all possible combinations of cues becomes unmanageable when the number of 
cues increases above a relatively small number, or when it is desirable to 
include many levels of each cue. However, if one is willing to assume that 
some of the higher-order interactions are zero, then it is possible to employ 
a fractional replication design and evaluate the importance of the main effects 
and lower-order interactions with a considerably reduced number of stimuli 
(Anderson, 1964; Cochran 6 Cox, 1957; Shanteau, 1969; and Slovic , 1969). 

The Functional Measurement Paradigm 

The technique of functional measurement can be considered an extension 
and generalization of the correlational paradigm. As such, it has formed the basis 
of an intensive program of research on information processing over the past 
decade. The essential ideas and representative results stem from the work 
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of Norman Anderson , and are summarized in Anderson (1968b ; 1969; 1970). A 
parallel approach, conjoint measurement, is outlined in papers by Luce and 
Tukey (1964) and Tversky (1967). 

Functional measurement attempts to perform several jobs simultaneously; 
the scaling of stimulus attributes and response measures and the determination 
of the psychological law or function relating the two. Its basic premise 
is that measurement scales and substantive theory are integrally related. In 
functional measurement studies of information processing, the subject 
receives several items of information that are to be integrated into a single 
judgment. The theoretical problem is to relate this judgment to the psycho- 
logical scale values and the weights of each item of information. Special 
attention has been given to judgmental tasks in which a simple algebraic 
model, involving adding, averaging, subtracting, or multiplying the infor- 
mational input, serves as the substantive theory. 

The main technical features of functional measurement are use of 
factorial designs, quantitative responses, and a procedure for monotonically 
rescaling these responses. The use of factorial designs arises from the fact 
that the theoretical models studied thus far have almost always been reducible 
to aii ANOVA model. Therefore, ANOVA has been the principle analytical tool, 
serving to represent the theoretical postulates and providing a goodness of fit 
test of the models. In addition, ANOVA provides estimates of such theoreti- 
cal parameters as the psychological values of the information items. 

Some examples may serve to illustrate the method. Consider the 
simplest application of functional measurement — to additive models. It 
is convenient to describe the model as applied to a two-way factorial design. 
The rows of the design would correspond to stimulus aspects (items of 
information) from Set S = [S^, ..• $*] and the columns to aspects from 
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Set T = [T, , T 0 ... T.]. Items within each set could represent the same 
sorts of information, as in studies of impression formation where Sets S and 
T each contain different adjectives descriptive of a person (Anderson, 1962). 
Or they could represent different stimulus dimensions, as in the correlational 
paradigm. An example of this is the study by Sidowski find Anderson (1967) 
where subjects judged the attractiveness of working at a certain occupation 
(Doctor, Lawyer, Accountant, or Teacher) in a certain city (City A, B, C, or 
D). Each cell of the design corresponds to a pair of items (two adjectives 
or a city-occupation combination) that the judge is to integrate. 

The subjective values of and T^ are denoted by the corresponding 
lower-case letters s^ and t^ , respectively. The equation for the basic 
model is then: 



R. . = W, s. + w 0 t . 
ID 1 i 2 x 



(4) 



where R^ is the theoretical response to the stimulus described by the pair 
of items (S^, T..) and and w^ are the weights of the row and column 
dimensions. It is usually assumed that the subjective value of an item is 
independent of the other item with which it is paired and that w^ and w^ 
are constant over row and column stimuli * respectively. 

Equation (4) implies that the row by column interaction is zero in 
principle and nonsignificant in practice. Therefore, ANOVA serves to 
test the model’s goodness of fit. If the model passes this test, it may 
be used to estimate the subjective values and T ^ • For example, if 
Equation (4) is averaged over columns, the mean for Row i^ is 



R. 

l* 



w., s . + constant , 
1 x 



(5) 



where the dot subscript on R denotes the average over the column index. The 
constant expression is w^t # , the same for all r-^ws. Equation (5) says that 
the row means form a linear function of the subjective values of the row stimuli. 
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in terms of the raw data* then, the row means constitute an interval scale 
of the row stimuli. Similarly, the column means constitute an interval 
scale of the column stimuli# 

All of the above results hold for an additive model. An averaging model 
constrains the weights to Siam to unity and this constraint provides a basis 
for estimating both scale values and weights (Anderson, 1970). 

Anderson (1969) notes that caution is required in interpreting the 
meaning of significant interactions when these occur. Interactions may occur 
as a result of cognitive configurality that violates the model or from 
defects in the measurement scale of the response such as floor and ceiling 
effects, response preferences, and anchor effects. In some cases, a monotonic 
rescaling of the' response can be used to eliminate the interaction and save 
the model. 

Anderson and colleagues have applied these techniques in a number of 

ingenious ways to study various substantive problems of information processing. 

For example, they have studied tasks in which stimuli were presented in 

serial order and the serial positions corresponded to factors in the design. 

The weights indicated by the main effects thus produce a serial position 

curve that can be used to assess primacy or recency effects in information 

combination (Anderson, 1965b; 1968a; Shanteau, 1969). When information 

is presented serially, Anderson (1965b) noted that the weighted average 

model can be reformulated in a manner that makes it particularly valuable 

for studying the step-by-step buildup of a judgment in response to each 

item. This form, called the "proportional change model" asserts that the 

*fcll 

judgment, R^, produced after the k item of information is received, is 
given by 



*k ■ E k~i - c k <s k - Vi> 



9 



( 6 ) 
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where 1 is the judgment prior, and is the judgment posterior, to 
presentation of the k item. The scale value of the k item is denoted by 
and is a change parameter that measures the influence of the k^ 1 
item. 

The Bayesian Approach 

Brunswik proposed the use of correlations to assess relationships in a 
probabilistic environment. He could have used conditional probabilities 
instead; had he done so, he undoubtedly would have built his lens model 
around Bayes* theorem, an elementary fact about probabilities described in 
1763 by the Reverend Thomas Bayes. The modern impetus for what we are 
calling the Bayesian paradigm can be traced to the work of von Neumann and 
Morgenstern (1947) who revived interest in maximization of expected utility 
as a core principal of rational decision making, and to L. J. Savage, 
vjhose book The Foundations of Statistics fused the concepts of utility and 
personal probability into an axiomatized theory of decision in the face of 
uncertainty, **a highly idealized theory of the behavior of a National* 
person with respect to decisions** (Savage, 1954, p. 7). 

The Bayesian approach was communicated to the world of business and 
economics by Schlaifer (1959). Psychologists were introduced to Bayesian 
notions by Ward Edwards (Edwards, 1962; Edwards, Lindman, 6 Savage, 1963) 
and much of the empirical work to be discussed was stimulated directly by 
the ideas in these two papers. 

The Bayesian approach is thoroughly embedded within the framework of 
decision theory. Its basic tenets are that probability is orderly opinion 
and that the optimal revision of such opinion, in the light of relevant 
new information, is accomplished via Bayes* theorem. Edwards (1966) 
noted that, although revision of opinion can be studied as a separate 
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phenomenon, it is most interesting and important when it leads to decision 
making and action. The output of a Bayesian analysis is not a single predic- 
tion but rather a distribution of probabilities over a set of hypothesized 
states of the world. These probabilities can then be used, in combination 
with information about payoffs associated with. .various decision possibili- 
ties and states of the world, to implement any of a number of decision rules, 
including the maximization of expected value or expected utility. 

Bayes* theorem is thus a normative model. It specifies certain internally 
consistent relationships among probabilistic opinions and serves to prescribe, 
in this sense, how men should think. Much of the psychological research 
has used Bayes * theorem as a standard against which to compare actual 
behavior and to search for systematic deviations from optimality. 

The Bayesian model. Given several mutually exclusive and exhaustive 
hypotheses, and a datum, D, Bayes* theorem states that: 



P(H./D) = 



PCD/IL) P(H.) 

pTd) 



(7) 



In Equation ( 7 ) ? P(H^/D) is the posterior probability that is true, 
taking into account the new datum, D, as v;ell as all previous data. P(D/H^) 
is the conditional probability that the datum D would be observed if 
hypothesis HL were true. For a set of mutually exclusive and exhaustive 
hypotheses H^, the values of P(D/H^) represent the impact of the datum D on 
each of the hypotheses. The value P(lO is the prior probability of 
hypothesis IL. It, too, is a conditional probability, representing the 
probability of conditional on all information available prior to the 
receipt of D. P(D), the probability of the datum, serves as a normalizing 



constant, and is equal to E P(D/H.) P(K.). Although Equation (7) is appropriate 

. ii 

l 

for discrete hypotheses, it can be rewritten, using integrals, to handle a 



continuous set of hypotheses and continuous!}* varying data (Edwards, 1966). 
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It is often convenient to form the ratio of Equation (7) taken with 

respect to two hypotheses, Eh and Eh: . 

P(H./D) P(D/H. ) P(H.) 

x _ x x • 



P(H./D) P(D/H.) P(HJ 

For this ratio form , new symbols are introduced : 

= LR D ’ ^o * 

the posterior odds. 



*1 = 



P(H i /D) 

P(Hj/D) 



are equal to the product of the likelihood ratio of the datum, 

P(D/H . ) 



LRd = 



P(D/Hj ) 



times the prior odds. 



n = 



p(h 1 ) 

p ( H j) 



Bayes 1 theorem can be used sequentially to measure the impact of several 
data. The posterior probability computed for just the first datum is used 
as the prior probability when processing the impact of the second datum, and 
so on. The order in which data are processed makes no difference to their 
impact on posterior opinion. For n data, D^, the final posterior odds, 
given all the data, are 



°n = j\ LR » 

k=l k 



ft 



( 8 ) 



Equation (8) shows that data affect the final odds multiplicatively. If 
the log-^Q of this equation were taken, the log likelihood ratios would 
combine additively with the log prior odds. 
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Although the ratio form of Bayes 1 theorem summarizes all the information 
only in the case of two hypotheses , Baves* theorem can be used with any 
number of hypotheses , in which case one ends with a set of posterior odds 
that can be translated into a distribution of posterior probabilities across 
all the hypotheses. * 

The use of Bayes* theorem assumes that data are conditionally 
independent, i.e. , 

P(D./H.) = P(D./H. 9 D. ) . 

D 1 D * k 

If this assumption is not met, then the combination rule has to be expanded. 
For two data, the expanded version is l 

P(D_/H. ,D. ) P(D /H . ) P(H. ) 

P(H./D ,D ) = — 1 — — . (9) 

normalizing constant 

As more data are received, the equation requires further expansion and 
becomes difficult to implement. 

The meaning of the conditional indpendence assumption might be clarified 
by some examples. Height and hair length are negatively correlated, and 
thus non-independent, in the adult U.S. population (even these days), but 
within su3:>groups of males and females, height and hair length are, we might 
suppose, quite unrelated. Thus if the hypothesis of interest is the identifi- 
cation of a person as male or female , height and hair length data are 
conditionally independent, and the use of Bayes* theorem to combine these 
cues is appropriate. In contrast, height and weight are related both across 
sexes and within sexes, and are thus both unconditionally and conditionally 
non-independent. The evidence from these cues could not be combined via 
Bayes* theorem with out altering it as shown in Equation (9). One way of 
thinking about the difference between these two examples is that in the first 
case the correlation between the cues is mediated by the hypothesis: the 

person is tall and has short hair because he is male . In the case of 
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conditional non- independence , however, the correlation between the cues is 
mediated by something other than the hypothesis: the taller person tends 

to weigh more because of the structural properties of human bodies. 

Experimental paradigms . A hypothetical experiment, similar to one 
actually performed by Phillips and Edwards (1966), will illustrate a common 
use of the Bayesian model. The subject is presented with the following 
situation: Two bookbags are filled with poker chips. One bookbag has 70 

red chips and 30 blue chips. The other bag holds 30 red chips and 70 blue 
chips — but the subject does not know which bag is which. The experimenter' 

flips a coin to choose one of the bags. He then begins to draw chips from 

the chosen bag. After drawing a chip he shows it to the subject and then 
replaces it in the bag, stirring vigorously before drawing the next chip. 

The subject has in front of him a device for recording his responses: 

two upright rods and on them, 100 washers. Behind the rods is a board cali- 

brated so that the subject can easily tell how many washers are on each 
rod. One rod is labeled "70% Red,” the other, ”70% Blue.” 

The subject is asked to use the washers to express his opinion as to 
the probability that the predominantly red or predominantly blue bag is 
the one being sampled from. When the subject puts 75 washers on the ”70% 

Red” rod and 25 on the ”70% Blue” rod he is indicating his opinion that the 
chances are 75 in 100 that the predominant ly red bag was the one chosen. 

At the start, before the first chip is drawn, the subject is required to 
place 50 washers on each rod, indicating that each bag is equally likely to 
have been chosen. Then, after each chip is drawn, the subject reflects 
the revision of his opinion by moving from one rod to the other as many 
washers as he wishes . The subject sees 10 successive chips drawn; the basic 
information for the data analysis is the 10 responses the subject made 
after each chip. 
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The optimal responses are computed from Bayes T theorem. The data 

(poker chips) are conditionally independent because each sampled chip is 

replaced before the next is drawn* The prior odds are 1 (the bookbags were 

equally likely to be chosen) , and the likelihood ratios associated with red 

and blue chips are a function of the 70/30 proportions in each urn; 

T p _ P(Red Chip/H 70% Red ) _ 

LK Red Chip ~ P(Red Chip/H^ B , ue )" .3 * 

_ P(Blue Chip/H 70% Red) . 

Blue Chip " PCBlue Chip/H 7()% Blue) “ 

Thus the posterior odds of the predominently red urn having been chosen, given 
a sample of, say, 6 red chips and 4 blue chips are; 




The odds are greater than 5 to 1 that the predominantly red urn is the urn 
being sampled. This corresponds to a posterior probability for that urn of 
approximately .845. 

The primary data analysis compares subjects* probability revision upon 
receipt of each chip with those of Bayes* theorem. To supplement direct 
comparisons of Bayesian probabilities and subjective estimates, Peterson, 
Schneider, and Miller (1965) introduced a measure of the degree to which 
performance is optimal, called the accuracy ratio? 




AR = 



SLLR 

BLLR 



5 



where SLLR is the log likelihood ratio inferred from the subjects* probability 
estimates and BLLR is the optimal (Bayesian) log likelihood ratio. When log 
likelihood ratios are used 5 the optimal responses become linear with the amount 
of evidence favoring one hypothesis over the other. The AR can be viewed 
as the slope of the best fitting line on a plot of the log of subjects* 
responses against the log of optimal responses. The AR will be equal to 1 
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when the subject revises optimally, will be greater than 1 when the subject’s 
revisions imply greater certainty about the truth of one hypothesis than is 
justified by the data, and less than 1 when the reverse is true* In the 
rare case when a subject treats a datum as if it pointed to one hypothesis 
while the datum in fact pointed to the other, the AR would be negative* 

The task just described illustrates the use of a binomial data gener- 
ating model* The Bayesian paradigm, however, is capable of dealing with a 
great variety of different types of data — discrete or continuous, from the 
same or different sources, etc* 

Other Bayesian experiments have employed multinomial distributions to 
generate samples of data* Table 1 provides a hypothetical illustration. 

Table 1 

Some Multinomial Data Generating Hypotheses 





Data Class 


Subclasses 


Hypotheses About a Student’s GPA 

H 1 H 2 H 3 

Lower 33% Middle 33% Upper 33% 


D 2 


Verbal Ability 


1. Below average 


.55 


.30 


.15 


± 




2. Average 


.30 


.40 


.35 






3. Above average 


.15 


.30 


.50 


V 


Achievement 


1. At or below 








z 


Motivation 


average 


.75 


.50 


> 

o 

ID 






2 . Above average 


*25 


.50 


.50 


V 


Credit Hours 


1 . Below 12 


‘.15 


.25 


.20 


o 


Attempted 


2. 12 - 15 


.25 


.30 


.20 






3. 15 - 18 


.30 


.30 


.30 






4. Above 18 


.30 


.15 


.30 



Note* — Cell entries are P(D,,/H.) values* 

D K 1 



In this example three hypotheses concerning a college student’s grade point 
average (GPA) are related to three data sources (e.g., verbal ability, 
achievement motivation, and credit hours attempted). Each data source is 
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comprised of several subclasses of information (e.g., below average or above 
average achievement motivation). The entries in the cells of the resulting 
evidence -hypothesis matrix are conditional probabilities of the form PCD^/H^), 
i.e*, the probability that the k subclass of data class j would occur, 
given H^. 

If the daca subclasses are mutually exclusive and exhaustive, as is 
the case here, the conditional probabilities within any data source and any 
one hypothesis must sum to 1.00 (e.g. , a student must be either above, at, 
or below average on achievement motivation). Across hypotheses, the conditional 
probabilities need not sum to any constant (e.g., relatively few college 
students, regardless of GPA S take less than 12 credit hours of course work). 

The critical measure of relatedness between a cue and a hypothesis is 

represented here by three conditional probabilities, P(D^/H^), PCDj^/H^)? 

and P(D^/H^), rather than by a single correlation. The diagnosticity of 

a particular datum, D M , rests on . the ratios of the conditional probabilities 

across hypotheses. Thus below average verbal ability (D^) is highly 

diagnostic, whereas 15-18 credit hours attempted (D ors ) gives no information 

oo 

at all concerning GPA. 

Table 1 may be used to generate data sequences (hypothetical students) 
whose GPA classification is to be predicted by. subjects. 

Sclium (19G6b,p. 35) describes the experimental paradigm, based on 
a multinomial task, for studying information processing: 



"Samples of evidence, . . . , form the basic input 
to subjects. In experimental inference situations 
subjects are asked to assume some prior probability 
distribution across the hypothesis set. Then subjects 
aggregate and process the samples of evidence in 
order to revise their prior opinions about the 
likelihood that the various hypotheses had generated 
the evidence. In order to do this they need indi- 
cations of the impact of each item of evidence 

[i.e., P(D.,/H.)j. The experimental paradigm described 

1 
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above can rest upon an objective probability base 

in which an objective PCD.^/H. ) matrix has been 

3* x 

prscribed by the experimenter. Such a matrix can 
be given directly to subjects in which case their 
only task is to aggregate items of evidence with 
prescribed impact. In another case, subjects might 

be required to estimate the multinomial P(D.,/H.) 

3 -K i 

distributions under each hypothesis from relative 
frequencies of occurrence of the various subclasses. 

Subsequent knowledge of which hypothesis in fact 
generated each sample of evidence is necessary in. 
this case so that the evidence-hypothesis relation- 
ships or diagnostic impact of each item of evidence 
can be learned by the subject. In either case, 
subjective probability revisions (on the basis of 
evidence) are in the form of posterior probabilities 
[P(H^/D)] or some analog such as posterior odds* 

The direction and size of these revisions can be 
compared with the theoretical revisions prescribed 
by Bayes* theorem.** 

Information seeking experiments . The decision maker often has the 
option of deferring his decision while he gathers relevant information, 
usually at some additional cost. The information presumably will increase 
his certainty about the true state of the world and increase his chances 
for making a good decision. In seeking additional information, the decision 
maker must weight the relative advantage of the information to be purchased 
against its cost. Yftien the probabilistic characteristics of the task are 
well specified, an optimal strategy can be specified that will, in conjunc- 
tion with the reward for making a correct decision, the penalty for being 
wrong, and the cost of the information, specify a stopping point that will 
be optimal in the sense of maximizing expected value (Edwards, 1965; 

Raiffa 6 Schlaifex', 1961; YJald, 1947)., This task is a natural extension of 
the probabilistic inference tasks described above inasmuch as it requires 
the decision maker to link payoff considerations with his inferences in 
order to arrive at a decision, A large number of studies have investigated 
man*s ability to make such decisions. For example, one commonly studied 
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task uses the bookbag and the poker chip problem described earlier. As 
before s a sequence of chips is sampled* with replacement* from a bag with 
proportion of red chips equal to or P^. Instead of estimating the 
posterior probabilities for each bag* the subject must decide from which bag 
the sample is coming. In some cases * he must decide* prior to seeing the 
first chip* how many chips he wishes to see (fixed stopping). In other 
cases 5 he samples one chip at a time and can stop at any point and announce 
his decision (optional stopping). Space limitations prohibit further analysis 
of this body of research here. The interested reader is referred to papers 
by Fried and Peterson (1969)* Pitz (1969b*c)* Rapoport (1969)* and Wallsten 
(1968) for examples of this and related research. 

Comparisons of the Bayesian and Regression Approaches 

Having completed cur overview of the basic elements of the regression 
and Bayesian approaches* it is appropriate to consider briefly some of the 
similarities and differences between them. At first glance* it would seem 
that the dissimilarities predominate. This impression is fostered* primarily* 
by the grossly different terminology used within each approach. However* 
closer examination reveals many points of isomorphism. 

Points of Similarity and Difference 

First and foremost* each paradigm is based on a theoretical model of 
the process whereby information is integrated into a judgment or decision. 
Furthermore* these various models are closely related. The simple 
linear model plays a key role in both correlational and functional measure- 
ment studies. Bayes 1 theorem is also a linear model under a logarithmic 

n 

transformation (i.e.* Equation (8) translates into log & = £ log LR n + 

n k=l K 

log ft Q ). in addition* the proportional change model of impression formation 
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(Equation 6) is Bayesian in spirit since it conceptualizes the step-by-step 
buildup of judgment in terms of a weighted combination of the present datum 
and prior response. Each of these models contains analogous descriptive 
parameters for the purpose of assessing the relevance of data dimensions or 
data items for the judge. Thus correlational studies describe correlations 
(r. ), regression weights (b, ), or relative weights (RW. ); ANOVA studies 

X , S 1 9 S X 9 s 

2 

estimate w. values; functional measurement models produce estimates of w 9 and 

1 9 S K 

Bayesians infer subjective likelihood ratios. Despite the different termin- 
ology 9 the similarity of purpose is marked. 

Both the lens model and the Bayesian approach share a deep concern about 
the relationship between the decision maker and his environment. Both 
models compare what the decision maker does with what he should be doing. 
However 9 the meaning of optimality differs in the two models. The optimality 
of multiple regression rests on the acceptance of a built-in payoff function: 
the least squares criterion of goodness-of-fit. The Bayesian model is 
optimal in a different sense: an idealized rational decision maker can be 

satisfied that he is logically consistent. The resulting posterior 
probabilities can be combined with any payoff function to determine the 
best action. In certain circumstances 9 Bayesian and multiple regression 
models lead to identical solutions, as in the case of determining an 
optimal decision boundary between two hypotheses on the basis of normally 
distributed and standardized data (Koford & Groner, 1966). In contrast to the 
lens and Bayesian models, the functional measurement approach and the stream 
of correlational research exemplified by Hoffman’s worl^ differ by being 
exclusively descriptive in intent. 

The data that serve as input to the decision maker vary somewhat both 
within and across each approach. The correlational paradigm typically 
involves dimensions of data, usually monotonically related to the criterion. 
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Data processed within the functional measurement and Bayesian studies, by 
contrast, are typically discrete, categorical particles, although these 
approaches can also process dimensional data* As to the relationsips among 
data sources, functional measurement requires factorially combined data 
elements and workers within the descriptive stream of correlational research 
also prefer orthogonal structure. Lens model research often uses data that 
are correlated in a fashion representative of the real world. Bayes* 
theorem, however, requires conditionally independent data. A rough trans- 
lation of this requirement in correlational terms would be that the cor- 
relations between cues, with the criterion dimension partialled out, must 
be 2eX’0. 

The response required of the subject also differs across paradigms. 

The correlational and functional measurement approaches usually deal with 
a single-valued prediction (point estimate) about some conceptually contin- 
uous hypothesis. Bayesians would say that there is a probability distribution 
over this continuous distribution and that the subject *s single judgment 
must represent the output of some covert decision process in which some 
implicit decision rule is applied (for example, the response may be inter- 
preted as specifying the criterion value having the largest probability of 
occurrence), based on some implicit payoff matrix. Some Bayesian studies 
also require subjects to make predictions, usually concerning discrete 
hypotheses. When they do, the payoffs accompanying correct and incorrect 
predictions are usually made explicit to the decision maker. Most often, 
however, subjects in Bayesian studies estimate the posterior or conditional 
probability distribution (or some function thereof, such as odds) across 
various hypotheses. Although Bayes* theorem can, in principal, be applied 
to continuous hypotheses, the emphasis on probability distributions rather 
than point estimates makes such a task experimentally awkward (see, however. 
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Peterson 6 Phillips, 1966)* 

The Bayesian paradigm looks at fixed hypotheses and examines the manner 
in which their subjective likelihood is revised in the light of new information 
about the world* For this reason it has been called a "dynamic" paradigm* 

In contrast, most of the correlational research deals with "static" aspects 
of information processing: when a subjective weight is inferred from a 

subject's responses over 50 trials, it is assumed that the subject's view 
of the world is unchanged over this period* however, the static vs. dynamic 
distinction is not inherent in the models. A good example of this point is 
illustrated within the functional measurement paradigm where the weighted 
average model takes both a static form (Equation 4) and a dynamic form — 
the "proportional change model" of Equation 6.’ In like manner, a regression 
equation can handle information sequentially and the item-by-item revision 
of judgment can be compared to optimal revisions specified by the equation* 

Testing the Models 

Although the models are similar, the attitudes of researchers towards 
testing them differ somewhat* Workers within correlational and Bayesian 
settings have typically been satisfied that high correlations between their 
model's predictions and the subject's responses provide adequate evidence 
for the validity of the model. For example. Beach (1966) observed correla- 
tions in the .90's between subjects' P(H/D) estimates and Bayesian values 
that were calculated using subjects' earlier P(D/H) estimates. He concluded 
that : 

"® * * Ss possess a rule for revising subjective probabilities 
that they apply to whatever subjective probabilities they 
have at the moment* 

"As has been amply demonstrated, the Ss 1 revision rule 
is essentially Bayes' theorem. That is to say, Ss* 
revisions can be predicted with a good deal of precision 
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using Bayes* theorem as the model 11 (Beach, 1966, p. 36)* 

However, Anderson, working within the functional measurement paradigm, 

has chided other researchers for neglecting to test goodness of fit: 

"Tests of qualitative predictions clearly require 
evaluation of the discrepancies from prediction. Much 
of the earlier work • . .is unsatisfactory in this 
regard since it is based on regression statistics and 
goes no further than reporting correlations between 
predicted and observed." (Anderson, 1969, p. 64). 

Anderson goes on to note that high correlations may occur despite a seriously 

incorrect model. As evidence of this he cites a study by Sidowski and 

Anderson (1967) which found a correlation of .986 between the data and a 

simple additive model despite the fact that the ANOVA showed a statistically 

significant and substantively meaningful interaction. 

The Dilemma of Paramorphic Representation 

The mathematical models we have been discussing serve, at the very least, 
to provide a quantified, overall, descriptive summary of the manner in which 
information is weighted and combined. To what extent do they represent 
the actual cognitive operations performed by the judge? Hoffman (1960, 1968) 
raised a problem particularly germane to this question. He observed that: 
a) two or more models of judgment may be algebraically equivalent yet sugges- 
tive of radically different underlying processes; and b) two or more models 
may be algebraically different yet equally predictive, given fallible data. 
Drawing an analogy to problems of classification in minerology, Hoffman 
introduced the term "paramorphic representation" to remind us that "the 
mathematical description of judgment is inevitably incomplete «, . . , 
and it is not known how completely or how accurately the underlying process 
has been represented" (Hoffman, 1960, p. 125). Although Hoffman raised the 
paramorphic problem in connection with models based upon correlational 
techniques, Bayesian models also face. this dilemma. 
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Empirical Research 

As the preceeding discussion indicated, judgment researchers are studying 
similar phenomena but with somewhat different methods. In the remainder 
of this chapter we shall survey the empirical research spawned by the theory 
and methodology described above. Table 2 outlines the organization of our 
coverage. We have partitioned regression studies according to whether 
they were conducted within the correlational or functional measurement 
paradigms. We have further categorized the work according to five broad 
problem areas relating to the use of information by the decision maker. 



Insert Table 2 about here 



The first category is devoted to a focal topic of research within each 
paradigm. For the correlational paradigm, this focal topic is the specifi- 
cation of the policy equation for the judge, including the closely related 
problem of whether to include non-linear terms in the policy equation. The 
focal topic of functional measurement is the distinction between two variants 
of the linear model, the summation model and the averaging model, in impression 
formation. In Bayesian research, the focal topic is a particular form of 
sub-optimal performance called conservatism. These topics are not closely 
inter-related. They are emphasized here simply because they have received 
so much attention in the three research areas. 

The second research category is devoted to the task determinants 
of information use. While many of these task variables are similar across 
differing paradigms, the dependent variables of such studies are less 
comparable, because they are so often closely related to the focal topics 
of the paradigms. For example, consider the task variable of number of 
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items of information. In the correlational paradigm, Einhorn (in press) has 
shown a decrease in r g , subjects* consistency, as the number of cues increased 
Anderson (1965a) used varying set sizes in a functional measurement paradigm 
to test predictions of his averaging model. In a Bayesian setting, Peterson, 
DuCharme, and Edwards (1968) have shown that larger sample sizes yield 
greater conservatism. Because task variables such as these .are so closely 
linked to the focal topic in each area, we will report each group of studies 
directly after the relevant focal topic. We will restrict our coverage 
to what are primarily performance studies — e.g., studies in which the 
judge either has learned the relevant characteristics about the information 
he is to use prior to entering the experiment, or, alternatively, is given 
this information at the start. In other words, this research is concerned 
with evaluating how the judge uses the information he has and not with how 
he learns to use this information. 

Additional research categories are devoted to sequential determinants 
of information use, learning to use information, strategies for combining 
information, and techniques for aiding the decision maker. 

Focal Topic of Correlational and ANOVA Research: 

Modeling a Judge's Policy 

The Linear Model 

A large number of studies have attempted to represent the judge's 
idiosyncratic weighting policy by means of the linear model (Equation 2). 
Examination of more than thirty of these studies illustrates the tremendous 
diversity of judgmental tasks to which the model has been applied. The 
tasks include judgments about personalitr T characteristics (Hammond, Hursch, 6 
Todd, 1964; Knox 6 Hoffman, 1962); performance in college (Dawes, 1970; 
Einhorn, in press; Newton, 1965; Sarbin, 1942); or on the job (Madden, 1963; 
Naylor 6 Wherry, 1965); attractiveness of common stocks (Slovic, 1969); and 
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other types of gambles (Slovic 6 Lichtenstein, 1968); physical and mental 
pathology (Goldberg, 1970, ; Hoffman, Slovic, £ Rorer, 1968; Oskamp, 1962; 

Wiggins 6 Hoffman, 1968); and legal matters (Kort, 1968; Ulmer, 1969). 

In some cases, the stimuli are artificial and the judges are unfamiliar 
with the task. Typical of these is a study by Knox 6 Hoffman (1962), who 
had college students judge the intelligence of other students on the basis 
of grade point average, aptitude test scores, credit hours attempted, etc., 
and a study by Summers (1968), who had students rate the potential for 
achieving minority group equality as a function of legislated opportunities 
and educational opportunities. At the other extreme are studies of judgments 
made in complex but familiar situations by skilled decision makers who had 
other cues available besides those included in the prediction equation. For 
example, Kort (1968) modeled judicial decisions in workmen’s compensation 
cases using various facts from the cases as binary cues. Brown (1970) modeled 
caseworkers 1 suicide probability estimates for persons phoning a metropolitan 
suicide prevention center. The cues were variables such as sex, age, 
suicide plan, etc., obtained from the telephone interview. Another example 
is the work of Dawes (1970) who built a linear model to predict the ratings 
given applicants for graduate school by members of the admissions committee. 

In all of these situations the linear model has done a fairly good job 

of predicting the judgments, as indicated by r values in the . 80’s and . 90’s 

• s 

for the artificial tasks and the . 70’s for the more complex real-world 
situations. Most of these models were not cross-validated. However, in the 
few studies that have applied the linear model derived from one sample of 
judgments to predict a second sample, there has been remarkably little 
shrinkage — usually only a few points (Einhorn, in press; Slovic £ Lichtenstein, 
1968; Summers 6 Stewart, 1968; and Wiggins 6 Hoffman, 1968). 
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Capturing a Judge *s Policy 

The various dimensions of a stimulus object are certainly not equally 
important and judges do not weight them equally. One of the purposes of 
using the linear model to represent the judgment process is to make the 
judge* s weighting policy explicit. 

Large individual differences among weighting policies have been found 
in almost every study that reports individual equations. For example , Rorer, 
Hoffman, Dickman, and Slovic (1967) examined the policies whereby hospital 
personnel granted weekend passes to patients at a mental hospital. They 
noted that: ’Tor five of the six items (cues) there was at least one judge 

for whom it was the most important and at least one for whom it v:as nonsigni- 
ficant" (p. 196). A striking example of individual differences in a task 
demanding a high level of expertise comes from a study of nine radiologists 
by Hoffman., Slovic, and Rorer (1968) » The stimuli were hypothetical ulcers, 
described by the presence or absence of seven roentgenological signs. Each 
ulcer was rated according to its likelihood of being malignant. There was 
considerable disagreement among radiologists* judgments as indicated by a 
median inter judge correlation, across stimuli, of only .38. A factor analysis 
of these coih?elations disclosed four different categories of judges, each of 
which was associated with a particular kind of policy equation. 

Even when expert judges don't disagree .with one another, an attempt to 
model them can be enlightening. For example, seven of the nine radiologists 
studied by Hoffman et al. viewed small ulcer craters as more likely to be 
malignant than large craters. Yet a follow-up study by Slovic, Rorer, and 
Hoffman(in press) describes statistical evidence obtained by other researchers 
indicating just the opposite — that large craters are more likely than small 
ones to be malignant. In a similar fashion, Hammond, Hursch, and Todd 
(1964) reanalyzed data obtained by Grebstein (1963) in which clinical 
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psychologists with varying levels of experience judged IQ on the basis of 
Rorschach signs. By examining policy equations they were able to trace 
the performance decrement shown by inexperienced judges to their misuse of 
one particular sign. 

The ability of regression equations to describe individual differences 
in judgment policies has led to the development of a number of techniques for 
grouping or clustering judges in terms of the homogeneity of their equations. 
(Christal, 1963; Maguire 6 Glass, 1968; Naylor 6 Wherry, 1965; Wherry £ 

Naylor, 1966; Williams, Harlow, Lindem, 6 Gab; 1970). Although a few of 
these studies have compared the methods, their relative utility remains to 
be demonstrated. 

In summary, it is apparent that the linear model is a powerful device 
for predicting repeated quantitative judgments made on the basis of specific 
cues. It is capable of highlighting individual differences and misuse of 
information as well as making explicit the causes of underlying disagreements 
among judges in both simple and complex tasks. Thus, it would appear to have 
tremendous potential for providing insight into expert judgment in many 
areas . 

Nonlinear Cue Utilization 

Despite the strong predictive ability of the linear model, a lively 
interest has been maintained in what Goldberg (1968) has referred to as "the 
search for configural judges." The impetus for this search comes from Meehl’s 
(1954) classic inquiry into the relative validity of clinical versus actuarial 
prediction. Meehl proposed that one possible advantage of the clinical 
approach might arise from the clinician’s ability to make use of configural 
relationships between predictors and a criterion. 

A clue to one outcome of the search was provided by Yntema and Torgerson 
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(1961) who hypothesized that, whenever predictor variables are monotonically 
related to a criterion variable, a simple linear combination of main effects 
will do a remarkably good job of predicting, even if interactions are known 
to exist. Yntema and Torgerson demonstrated their contention by presenting 
an example in which they showed that 94% of the variance of a truly configural 
function could be predicted from an additive combination of main effects. 

Early work by Hoffman, some repprted in Hoffman (1960) and some unpub- 
lished, indicated that configural terras based on the judge's verbalizations 
added little or no increment of predictable response variance to that contributed 

by the linear model. The r values were approximately as great as the retest 

s 

reliabilities, casting additional doubt about the existence of meaningful 
nonlinearities. Hursch, Hammond, and Hursch (1964); Hammond, Hursch, and 
Todd (1964); and Newton (1965) reported unsuccessful attempts to find evidence 
of configurality using the C coefficient, although the ambiguous nature of 
low C values does not preclude the possibility of configural judgment 
processes (Lack of nonlinearity in the environment or a difference between the 
nonlinearity in the environment and judgmental systems are sufficient to 
insure low C values). 

In light of the simple but compelling arithmetic underlying the "main 
effect approximation" one would expect that the results of this early research 
should not have been too surprising. Yet the search continued, buoyed by 
(a) the repeated assertions of human judges to the effect that their processes 
really were complex and configural; (b) the possibility that previous 
experimenters had not yet studied the right kinds of tasks — tasks that were 
"truly configural" and (c) the possibility that the experimental designs 
and statistical procedures used in previous studies were not optimally suited 
for uncovering the existing configural effects. 
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For example, Wiggins and Hoffman (1968) used a more sophisticated approach 

in their study of the diagnosis of neuroticism vs. psychoticism from the 

MMPI. Their data, 861 MMPI profiles from 7 hospitals and clinics, was selected 

because MMPI lore considered it to be highly configural with respect to 

this type of diagnosis. In addition to criterion diagnoses, the judgments 

of 29 clinical psychologists were available for each profile. Besides using 

the linear model, Wiggins and Hoffman employed a "quadratic model which 

included the 11 MMPI scale scores (X^) as in the linear model along with all 

2 

11 squared values of these scales (X^) and the 55 cross-product terms 

(X. • X.). The third model tested was a "sign model" which included 70 
1 3 

diagnostic signs from the MMPI literature, many of which were nonlinear. The 

coefficients for each model and each judge were derived using a stepwise 

regression procedure. Cross validation of the models in a new sample indicated 
that thirteen subjects were best described by the sign model, 3 by the quad- 
ratic model, and 12 by the linear model. But even for the most nonlinear 
judge the superiority of his best model over the linear model was slight 
(.04 increase in r g ). 

Summers and Stewart (1968) applied a similar tactic to search for nonlin- 
earity in a new domain. They had undergraduate subjects predict the long-range 
effects of various foreign policies on the basis of four cues. Application 
of a linear and quadratic model in derivation and cross-validation samples 
produced results quite congruent with those of Wiggins and Hoffman. Studies 
in which judges rated the attractiveness of gambles (Slovic 6 Lichtenstein, 
1968)., evaluated the quality of patient care in hospital wards (Huber, Sahney, 6 
Ford, 1969) 9 and made decisions about workmens compensation cases in a court 
of law (Kort, 1968) also found only minimal improvements in predictability 
as a consequence of including configural and curvilinear terms. 



In an attempt to demonstrate the existence of configural effects, a 
number of investigators dropped the regression approach in favor of ANOVA 
designs applied to systematically constructed stimuli in tasks ranging from 
medical .diagnosis to stock market forecasting (Hoffman, Slovic, £ Rorer, 

1968; Rorer, Hoffrmm, Dickman, 6 Slovic, 1967; Slovic, 1969 )• These studies 
did succeed in uncovering numerous instances of interaction among cues but 
the increment in predictive power contributed by these configural effects 
was again found to be small. 

This line of research, employing both correlational and ANOVA techniques, 
can be summarized simply and conclusively. The hypothesis of Yntema and 
Torgerson has clearly been substantiated • . The linear model accounts for 
all but a small fraction of predictable variance in human judgment across a 
remarkably diverse spectrum of tasks. 

However, the ANOVA research and other recent studies aimed at assessing 

the predictive power of nonlinear effects have exposed a different view of the 

problem, one that accepts the limited predictive benefits of nonlinear models 

but, simultaneously, asserts the definite, indeed widespread, existence of 

nonlinear judgment processes, and highlights their importance with regard to 

theoretical and general explanatory considerations. Their philosophy is 

typified by Green’s argument to the effect that : 

"Nonlinear relationships and interactions are to a first 
approximation linear and additive .... If the goal is 
prediction ... an adequate description will serve. But 
if the goal is to understand the process, then we must 
beware of analyses that mask complexities" (Green, 1958, p. 98). 

To illustrate the complexity inherent in judgments that are quite predict- 
able with a linea?.* model, consider the data from the previously-described study 
of ulcer diagnosis conducted by Hoffman, Slovic, and Rorer (1968). An 
ANOVA technique showed that each of the nine radiologists who served as 
subjects exhibited at least two statistically significant interactions. One 



showed thirteen. Across radiologists there were 24 significant two-way, 17 
three-way, and 14 four-way interactions. A subset of only 17 cue-config- 
urations, out of a possible 57, accounted for 43 of the 57 significant 
interactions. Thus numerous instances of configurality were evidenced and 
a subset of specific interactions occurs repeatedly across 
radiologists. Hoffman et al. did not attempt to probe into the content of 
the interactions they observed but Slovic (1969), in his study of stock- 
brokers, and Kort (1968), in his study of workmen’s compensation decisions, 
did and both uncovered information that provided worthwhile insights into 
the rationale behind their judges 1 nonlinear use of the cues . 

Anderson has also paid careful attention to interactions obtained in his ■ 
ANOVA studies of impression formation and has found several of substantive 
interest. For example, Anderson and Jacobson (1965) had subjects judge the 
lilcableness of persons described by sets of three adjectives. They found an 
interaction which implied that the weight given a particular adjective was 
less for sets where that adjective was inconsistent with the implications of 
the other adjectives than for sets in which it was consistent. Sidowski and 
Anderson (1967) asked subjects to judge the attractiveness of working at a 
certain job in a certain city. The judgments of each city- job combination 
were found to be a weighted sum of the values for the two components except 
that the attractiveness of being a teacher was more dependent upon the 
attractiveness of the city than were the other occupations. This may be 
because teachers are in more direct contact with the cities 1 socio-economic 
milieu . Other interesting examples of interactive cue utilization have 
been found by Lampel and Anderson (1968) and Gollob (1968). 

Hoffman (1968) observed that an undirected search for configural relations 
within a finite set of data is fraught with statistical difficulties. Green 
(1968) concurred and criticized standard regression and ANOVA techniques for 
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being essentially fishing expeditions. A better strategy, he suggested, is 
to form some specific hypothesis about configurality and seek support for it* 
In this vein, Slovic (1966) hypothesized and found differences in subjects 1 
strategies for combining information as a function of whether cues were in 
conflict. When the implications of important cues were congruent, subjects 
seemed to use both* When they were inconsistent, subjects focused on one of 
the cues or looked to other cues for resolution of the conflict. This stud} 
and related studies reported in Hoffman (1968) and Anderson and Jacobson 
(1965) indicate that the linear model needs to beammended to include a term 
sensitive to the level of incompatibility among important cues. 

Tversky (1969) and Einhorn (1970, in press) also hypothesized, and found, 
specific nonlinear uses of information. Tversky showed that subjects 
sometimes chose among a pair of two-dimensional gambles by a lexicographic 
procedure in which they selected the gamble with the greater probability of 
winning, provided that the difference between gambles on this dimension 
exceeded some small value If the difference was less than or equal to £, 
these subjects selected the gamble with the higher payoff. In contrast 
to the linear model, this sort of strategy is noncompensatory inasmuch 
as no amount of superiority with regard to payoff can overcome a deficiency 
greater than 5 on the probability dimension. 

Einhorn made a valuable contribution to our understanding of nonlinear 
judgment by developing mathematical functions that could be incorporated into 
a prediction equation to approximate conjunctive and disjunctive processes 
as postulated by Coombs (1964) and Dav?es (1964). Dawes described the 
evaluation of a potential inductee by a draft board physician as one example 
of a conjunctive process. The physician requires that the inductee meet an 
entire set of minimal criteria in order to be judged physically fit. A 
disjunctive evaluation is a function of the maximum value of the stimulus 
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or one of its attributes. For example, a scout for a pro football team may 
evaluate a player purely in terms of his best specialty, be it passing, 
running, or kicking. Neither the conjunctive nor the disjunctive models 
weighs one attribute against another as does the linear model. Einhorn pitted 
his conjunctive and disjunctive models against the linear model in two tasks - 
one where faculty members ranked applicants for graduate school, the other 
where students ranked jobs according to their preferences. Using a cross- 
validation sample of judgments, he found that many subjects were fit better by 
the nonlinear, noncompensatory models than by the linear model. The conjunc- 
tive model proved superior to the disjunctive model, especially as the number 
of cues increased. Einhorn concluded by criticizing the notion that cognitive 
complexity and mathematical complexity go hand in hand. He argued that non- 
linear, noncompensatory strategies may be more simple, cognitively, than the 
linear model, despite their greater mathematical complexity. 

At this point, it seems appropriate to conclude that notions about non- 
linear processes are likely to play an increasing role in our understanding 
of judgment despite their limited ability to outpredict linear models. 

Sub jecti ve Po licies and Self Insight 

Thus far wo have been discussing weighting policies that have been 
assessed by fitting a regression model to a judge’s responses. We think 
of these as computed or objective policies. Judges in a number of studies, 
after a' policy was computed on the basis of their responses, were asked 
to describe the relative weights they were using in the task. The 
conges pond ence between these "sub jective weights" and the computed weights 
serves as an indicant of the judge’s insight into his own policy. Martin 
(1957) and Hoffman (1960) proposed the technique that has been used 
to examine these "after the fact" opinions — that of asking the 
judge to distribute 100 points according to the relative importance of each 
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attribute. Martin found that the linear model based on subjective weights 

produced a mean r of .77 in predicting evaluations of a student's sociability, 
s 

A linear model computed in the usual way but not cross validated did better, 
producing a mean r of .89. Hoffman (1960), Oskamp (1962) , Pollack (1964), 

Slovic (1969), and Slovic, Fleissner, and Bauman (in press) all found serious 
discrepancies between subjective and computed relative weights for certain 
individual judges. 

One type- of error in self insight has emerged in all of these studies • 

Judges universally and strongly overestimate the importance they place on 
minor cues (i.e., their subjective weights greatly exceed the computed weights 
for these cues) and they underestimate their reliance on a few major variables. 

Subjects apparently are quite unaware of the extent to which their 
judgments can be predicted by only a few cues . Across a number of studies , 
varying in the number of cues that were available, three cues usually 
sufficed to account for more than 80% of the predictable variance in the 

i 

judge’s responses* The most important cue usually accounted for more than 40% 
of this variance. 

Shepard (1964, p. 266) presented an interesting explanation of the subjective 
underweighting of important cues and overweighting of minor cues. He hypothe- 
sized: 

"Possibly our feeling that we can* take into account a host 
of different factors comes about because, although we remem- 
ber that at some time or other we have attended to each of 
the different factors , we fail to notice that it is seldom 
more than one or twc that we consider at any one time." 

Slovic, Fleissner, and Bauman (in press), studying the policies of 
stockbrokers, examined the relationship between number of years as a broker 
and accuracy of self- insight . The latter was measured by correlating a 
broker’s subjective weights with his computed weights across eight cue 
factors. Across 13 brokers, the Spearman rank correlation between the 
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insight index and experience was -.43. Why should greater experience lead 
to less valid self- insight? Perhaps the recent training of the young 
brokers necessitated an explicit awareness of the mechanics of the skill 
that they were attempting to learn* Skills generally demand a great deal of 
conscious attention as they are being acquired. With increasing experience, 
behaviors become more automatic and require much less attention. Because 
of this they may also be harder to describe. This question is an intriguing 
one and needs to be investigated with more precision than was done in this 
study. It may be that the most experienced judges produce verbal rationales 
for their evaluations that are less trustworthy than those of their inex- 
perienced colleagues! 

Task Determinants in Correlational and ANOVA Research 
Cue Interrelationships 

Several studies have examined the role of intercorrelational structure 

and conflict among cues in determining the weighting of those cues. Slovic 

(1966; see also Hoffman, 1968) found that, when important cues agreed, 

subjects used both and weighted them equally. When they disagreed, subjects 

focused on one of the cues or looked to other cues for resolution of the 

conflict. Also, in situations of higher cue conflict, r was considerably 

s 

lower. These effects were found both when cue-conflict and cue-intercorrela- 
tions varied together and when conflict was varied holding intercorrelations 
constant. Dudycha and Naylor (1966) have also studied the effect of varying 
the intercorrelations among cues upon policy equations. They found that 

profiles of r. values showed less scatter and r decreased as the correlation 
l , s s 

between cues decreased. 
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Cue Variability and Cue Utilization 

Uhl and Hoffman (1958) hypothesized that an increase in the variability 
of a salient cue, across a set of stimuli, will lead to greater weighting of 
this cue by a judge* This increased weight will persist, they proposed, even 
among subsets of stimuli for which this cue is not unusually variable. 

Underlying this hypothesis was the assumption that the judge is motivated to 
differentiate among stimuli along the criterion dimension and that cues 
which increase his ability to differentiate will reinforce and increase his 
use of those attributes. The variability of a salient cue is one such 
feature that correlates with differentiability. Presumably judges will 
focus their attention on the more highly-variable cues, other things being 
equal. 

Uhl and Hoffman tested this hypothesis in a task where subjects judged 
IQ on the basis of profiles made up of nine cues. Each subject judged several 
sets of profiles on different days. The variability of a particular cue 
was increased on one day by providing a greater number of extreme levels. On 
a following day, the cue was returned to normal and its relative weight was 
compared with the weight it received prior to the manipulation of variability. 

For one group of subjects a highly valid cue was manipulated in this way. 

For a second group, the variability of a minor cue was altered. The 
hypothesized effect was found in seven out of ten subjects when a strong 
cue was manipulated. Increasing the variability of the minor cue had no 
effect upon its subsequent use. The authors came to the tentative conclusion 
that the judge may alter his system of judgment because of the characterist? cs 
of samples he judges. 

Morrison and Slovic (1962) independently tested a version of the variability 
hypothesis in a different type of setting. Each of their stimuli consisted 
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of a circle (dimension 1) paired with a square (dimension 2). Subjects had 
to rank order a set of these stimuli on the basis of their total area (circle 
and square combined). The results indicated that if the variability of 
circle area was greater than the variability of square area in the set of 
stimuli 3 then circle area would be assigned much heavier weight in the 
judgment. If the variability of square area was higher in the set * then 
square area became the dominant dimension. 

Cue Format 

Knox and Hoffman (1962) examined the effect of profile format on judg- 
ments of a person's intelligence and sociability. Each subject based his 
judgments on profiles of cues. In one condition * the scores were presented 
as T scores with a mean of 50 and standard deviation of 10. A second condition 
presented the information in percentile scores. The latter profiles showed 
considerably more scatter and had more extreme scores than the . former which 
tended to appear rather "squashed." Judgments were made on a stanine 
scale with a normal distribution suggested but not forced. Judgments made to 
percentile scores were found to be much' more variable. It appeared that 
judges were responding not only to the underlying meaning of the scores but 
to the graphical position of the points on the profile in an absolute sense. 
Subjects were reluctant to make ratings on the judgment scale that were more 
extreme than the stimulus scores. Being statistically naive 9 they were 
unable to gauge the true extremeness of certain T scores. Judgments made to 
percentile scores wer*? also more reliable and produced higher values of r g 
when linear models were fitted to them. Beta weights did not differ signifi- 
cantly between formats. 

Number of Cues 




There has been surprisingly little correlational research done on the 
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effects of varying the number of cues upon a judge* s performance. A pilot 
study by Hoffman and Blanchard (1961) suggested some interesting effects 
but was limited by a small number of subjects. They had subjects predict 
a person’s weight on the basis of physical characteristics- They judged 
the same stimulus at different times 9 seeing either 2 9 5* or 7 cues. Increased 
numbers of cues led to lower r g values 9 decreased accuracy 9 lower test-retest 
reliability 9 and lower response variance. This latter finding may be the 
cause of some of the other results and may itself be due to an increased 
number of conflicts among cues in the larger data sets. 

Hayes (1964) and Einhorn (in press) also found that response consistency, 
decreased as the number of cues increased. Einhorn interpreted this 
decrease in r g as indicating that his subjects were using more complex models 
in the high information conditions — models whose variance was not predictable 
from the linear and nonlinear models he tested. However 9 the reliability of 
his subjects 1 judgments was not assessed and it is possible that greater infor- 
mation merely produced more unreliable rather than more complex judgments. 

Hayes found that increased numbers of cues also led to a reduction in decision 
quality when subjects were working under a time limit. 

Oskamp (1965) had 32 judges 9 including eight experienced clinical 
psychologists 9 read background information about a case . The information was 
divided into four sections. After reading each section of the case the 
judge answered 25 questions about the attitudes and behaviors of the subject 
and gave a confidence rating with each answer. The correct answers were 
known to the investigator. Oskamp found that., as the amount of information 
about the case increased 9 accuracy remained at about the same level while 
confidence increased dramatically and became entirely out of proportion to 



actual correctness. 
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In summary, there is evidence that increasing the amount of information 
available to the decision maker increases his confidence, without increasing 
the quality of his decisions and makes his decisions more difficult to 
predict. It is obvious, however, that more work is needed in this area. 
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Cue-Response Compatibility 

Fitts and Deininger (1954) introduced the concept of stimulus -response 
compatibility to explain the results of several paired-associates learning 
and reaction time experiments. Compatibility was defined as a function of 
the similarity between the spatial position of the stimulus in a circular 
array and the position of the correct response in the same sort of array. 

High compatibility produced the quickest learning and fastest reaction time. 
In a more recent experiment concerned with risk-taking judgments, Slovic and 
Lichtenstein (1968) observed a related type of compatibility effect that 
influenced cue utilization. They found that when subjects rated the 
attractiveness of a gamble, probability of winning war the most important 
factor in their policy equations. In a second condition, subjects were 
required to indicate the attractiveness of a gamble by an alternative 
method — namely equating the gamble with an amount of money such that 
they would be indifferent between playing the gamble and receiving the stated 
amount for certain. Here it was found that attractiveness was determined 
more by a gambled outcomes than by its probabilities. The outcomes, being 
expressed in units of dollars, were readily commensurable with the units of 
the responses — also dollars. On the other hand, the probability cues had 
to be transformed by the subject into values commensurable with dollars 
before they could be integrated with these other cues. It seems plausible 
that the cognitive effort involved in making this sort of transformation 
greatly detracted from the influence of the probability cues in the second 



task 
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This finding suggests a general hypothesis to the effect that greater 
compatibility between a cue and the required response should enhance the 
importance of that cue in determining the response. Presumably 5 the more 
complex the transformation needed to make a cue commensurable with other 
important cues and with the response 5 the less that cue will be used. 

Focal Topic of Functional Measurement: 

Models of Impression Formation 
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There is a substantial body of literature concerned with the problem 
of understanding how component items of information are integrated into 
impressions of people. Most of this research can be traced to the work of 
Asch (1946) who asked subjects to evaluate a person described by various trait 
adjectives. In one of Asch T s studies 9 the adjective "warm” was added to 
the set of traits. Another group saw the trait "cold." All other adjectives 
were identical. The subjects wrote a brief description of the person and 
completed an adjective checklist. Asch found that substitution of the word 
"warm" for "cold" produced a decided change in the overall characterization 
of the person being evaluated. He interpreted this as being due to a shift 
in meaning of the traits associated with the key adjectives "warm" and "cold." 
This view has much in common with the notions of configurality and inter- 
action we have been discussing. 

More recent endeavors have centered around the search for quantitative 
models of the integration process. That is 9 they attempt to develop a 
mathematical function of the scale values of the individual items to predict 
the overall impression. Although Asch explicitly denied that an impression 
could be derived from a simple additive combination of stimulus items 3 the 
additive model and its variations have received the most attention (Rosenberg 9 
1968). Most of these studies have used rigorous experimental control and 
statistical techniques such as ANOVA within the conceptual framework of 
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functional measurement. 

One of the first studies to test the additive model empirically was done 
by Anderson (1962). His subjects rated a number of hypothetical persons* each 
described by three adjectives* on a 20-point scale of likableness. Within 
each set there was one item each of high* medium* and low scale value as 
determined in a separate normative study. An additive model gave an excellent 
fit to the data. 

The additive model serves as a more general case for two derivative 
models — one based on the principle of summation of information *' the other 
an averaging formulation. In the summation model* the values of the stimulus, 
items are added to arrive at an impression. The averaging model asserts that 
an impression is the mean* rather than the sum* of the separate item values. 

Anderson's (1962) study did not attempt to distinguish between the 
averaging and summation formulations. To do so requires careful attention 
to subtle facets of stimulus construction and experimental design. Most of 
the recent research has taken on this challenge* varying task and design 
characteristics in an attempt to determine the validity of these and other 
competing models. The following section will review briefly the types of 
situational manipulations that have been brought to bear on this problem of 
modeling . 



Task Determinants of Information Use 
in Impression Formation 



Set Size 

The number of items of information in a set is one factor that has 
been varied in attempts to distinguish summation and averaging models. 
Fishbein and Hunter (1964) provided four groups of subjects with different 
amounts of positively evaluated information about a fictitious person. The 
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information was presented sequentially in such a way that the total summation 
of affect increased as a 'function of the number of items while the mean 
decreased — i.e. , the more highly evaluated items came first. The subjects 
used a series of bi-polar adjective scales to evaluate the stimulus persons. 
The judgments became more favorable as the amount of information increased 
thus supporting the summation model. The Fishbein and Hunter study has 
been criticized by Rosenberg (1968) who argued that presenting the most 
favorable adjectives first permits possible sequential effects to influence 
the results. 

Anderson (1965a) also used set size to contrast the two models. He had 
subjects rate the likableness of persons described by either two or four 
traits. He found that sets consisting of two moderately- valued traits and 
two extremely-valued traits produced a less extreme judgment than sets 
consisting of the two extreme traits alone. This result was taken as support 
for the averaging model. Another result of this study, that sets of four 
extreme adjectives were rated more extreme than sets with two extreme 
adjectives, confirmed earlier findings by Anderson (1959), Podell (1962) and 
Stewart (1965) to the effect that increased set size produces more extreme 
ratings. At first glance this seems to support a summation model but Anderson 
showed how it could be accommodated using an averaging model that incorporates 
an initial impression with non-zero weight and scale value s q . The final 
impression is assumed to be an average of this initial impression and the 
scale values of the traits. Algebraically, 

nws + (1-w) s 

J = - 

n nw + (1-w) 

where is the judgment based on n adjectives each of scale value s and 
weight w. The term s represents the initial or neutral impression. Its 
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relative importance is (l-w)/Lnw + (l-w)], a value that decreases as more 
information accumulates. Anderson (1967b) provided further support for 
this model. 

Extremity of Information 

Tile adjectives in Anderson’s (1965a) study were pres Timed to be of equal 
weight. Thus the averaging model predicted that the judgment of a stimulus 
set containing four items having extreme scale values averaged with the 
judgment of a set containing four items of moderate value would equal the 
judgment of a set containing two extreme and two moderate items. Anderson 
found that this prediction did not hold for negatively-evaluated items. 

The discrepancy suggested that the extreme negative items carried more 
weight than did moderately-negative information. 

Studies by Kerrick (1958), Manis , Gleason, and Dawes (1966), Osgood 
and Tannenbaum (1955), Podell and Podell (1963), Weiss (1963), and Willis 
(1960) also found indications that the weight of an information item is 
associated with the extremity of its scale value. Manis et al. found that 
two positive or two negative items, of information of different value would 
lead to a judgment less extreme than the most extreme item but more extreme 
than that predicted by a simple averaging of the items. At the same time 
these judgments were not extreme enough to be produced by the summation model. 
To account for these results, the authors suggested a version of the averaging 
model that weights items in proportion to their extremity. 

Redundancy 

Both summation and averaging models assume that the values of the stimulus 
items are independent of the other items in the set. This assumption has 
been the focus of concern for a number of studies. Dustin and Baldwin (i966), 
for example, had subjects evaluate persons described by single adjectives, A 
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and B, and by the combined pair AB. Ratings of AB pairs tended to be more 
extreme than the mean of the individual items; this tendency was dependent 
upon the degree of redundancy or implication between A and B as measured by 
their inter correlation in a normative sample. Schmidt (1969) did a similar 
study but varied the relatedness of the items differently. He combined 
trait sentences (Mr. A is kind) with instance sentences (Mr. A is kind to B). 
The 'two sentences just given are obviously highly redundant . By changing the 
trait adjectives this redundancy can be greatly reduced. Schmidt found that 
judgments based on less redundant sets were consistently more extreme than 
those based on more redundant information. Wyer reported similar findings 
in studies where redundancy was measured by the conditional probability of A 
given B (Wyer 5 1968) and by the degree to which the joint probability of 
occurrence (P ) exceeded the product of the two unconditional probabilities 
_(P. and P_) for each adjective (Wye r 9 1970). It seems apparent that models in 

A JD 

this area will need further revision to handle the effects of redundancy. 
Inter-Item Consistency 

The data just described indicate that highly-redundant information has 
less impact. But information with too great a "surprise value" shares a 
similar fate. Anderson and Jacobson (1965) found that an item whose scale 
value is highly inconsistent with its accompanying items (as is the trait 
"gloomy" in the set "honest-considerate-gloomy" ) was likely to be discounted — 
i.e. 3 given less weight. The discounting was only slight when subjects 
were told that all three. traits were accurate and equally important, but 
increased when subjects were cautioned that one of the items might not be 
as valid as the others. Anderson and Jacobson argued that the averaging 
model might have to include differential weights to accommodate the reduced 
impact of inconsistent information. However , they also pointed out that 
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their results might be caused by an order effect of the sort that gives more 
weight to information earlier in the sequence, 

Wyer (1970) defined inconsistency among two adjectives as the degree 
to which their joint probability (P ) was less than the product of their 

*i.D 

unconditional probabilities (P^ and P^). Note that this places high incon- 
sistency at the negative end of a continuum defined by P^ fi - P^P^, with 
maximum redundancy at the positive end. After constructing stimuli according 
to this definition, Wyer found that inconsistency produced a discounting of 
the less polarized of a pair of adjectives, leading- to a more extreme evalu- 
ation. However, when inconsistency became too great, both adjectives 
appeared to be discounted, producing a less extreme evaluation. 

Himmelfarb and Senn (1969) studied the effects of stimulus inconsistency 
in experiments concerned with judgments of a person’s social class. The 
stimulus persons were described by dimensional attributes — occupation, 
income, and education. Surprisingly, discounting of inconsistent information 
was not found. The authors speculated that their failure to find discounting 
here might have been due to the lack of directly contradictory information or 
to the possibility that social class stimuli, being objective aspects of an 
individual, might be less easily discounted than subjective personality 
traits , 

Other Contextual Effects 

Anderson and Lampel (1965) and Anderson (1966) had subjects form an 
impression based on three adjectives and then rate the likability of one of 
the component traits alone. Both studies produced context effects, judg- 
ments of the single trait being displaced towards the values of the other 
traits. Anderson (1966) noted that this type of deviation from an additive 
model does not show up as an interaction effect. Thus, lack of interaction 
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does not unequivocal) ly support the additive model • Anderson suggests that 
the most natural interpretation of this effect is that the value or meaning 
of the test word has changed as a function of the impression formation 
process, much as Asch originally suggested. Wyer and Watson (1969) found 
evidence to support this change of meaning interpretation over several 
competing hypotheses and Chalmers (1969) argued that change of meaning 
could readily be accommodated by Anderson’s weighted average model. 

Concluding Comments 

The additive model dominates the area of impression formation much as 
the linear model dominates correlational research. Like linearity, additivity 
is not a completely satisfactory concept, however, and there are many subtle 
factors competing with one another to determine the deviations in the data. 

The contradictory findings in this area are difficult to evaluate due to 
the considerable variation in types of stimuli, response modes, and 
instructions across studies. Manis, Gleason, and Dawes (1966; p. 418) 
seem to summarize the present state of affairs most aptly with their 
comment : 

!! It seems clear that our main need at the present 
time is for more research conceiving those variables 
(e.g., topic, situation, subject characteristics) 
which determine the combinatorial model that is applied 
in the evaluation of complex social stimuli; available 
evidence suggests that there is no single model which 
can be universally applied. 1 ’ 

Focal Topic of Bayesian Research: 

Conservatism 

The most common Bayesian study deals with probability estimation, often 
in some variant of the bookbag and poker chip experiment described earlier. 

The primary finding has been labeled conservatism : upon receipt of new 

information, subjects revise their posterior probability estimates in the 
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same direction as the optimal model, boat the revision is typically too 
small; subjects act as if the data are less diagnostic than they truly 
are. Subjects in some studies (Peterson, Schneider, 6 Miller, 1965; Phillips 

£ Edwards, 1966) have been found to require from two to nine data observation 
to revise their opinions as much as Bayes 1 theorem would, change after one 
observation. 

Much of the Bayesian research has been motivated by a desire to better 
understand the determinants of conservatism in order that its effects might 
be minimized in practical diagnostic settings. A spirited debate has been 
raging among Bayesians about which part of the judgment process leads 
subjects astray. The principle competing explanations as to the "locus of 
conservatism" are the misperception, misaggregation, and artifact hypotheses 
(Edwards, 1968). 

Misperception 

Perhaps subjects don’t understand the data generator underlying, or 
producing, the probabilistic data. Lichtenstein and Feeney (1968) showed 
that subjects performed very poorly when dealing with a circular normal data 
generator despite 150 training trials with feedback. But subjects’ data and 
comments suggested an entirely different (and incorrect) model regarding the 
meaning of each datum, and reanalyses of their responses showed them to be 
quite consistent with this simpler yet incorrect view of the data generator. 
Does such a simple and popular data generator as the binomial distribution 
also lead to misperceptions about the meaning of data? Vlek and Bientema 
(1967) and Vlek and van der Heijden (1967) showed that it does. Vlek and 
Bientema presented subjects with samples (e.g., 5 black and 4 white) drawn 
from an urn whose constituent proportions were known to the subject, and 
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asked them how often such a sample might be expected to occur in 100,000 
samples of the same size'. Vlek and van der Heijden asked for the probability 
that such a sample would occur in 100 trials. Both studies showed that subjects 
had poor understanding of the likelihood of data. 

If such misperceptions are the cause of conservatism, then one would 
expect estimates of posterior probabilities to be consistent with, and 
predictable from, estimates about the data generator. Beach (1966) showed 
that subjects do exhibit this consistency even when poorly trained in the 
characteristics of the task. He concluded that ". . . even though subjects 1 
subjective probabilities were inaccurate, they were still the bases for 
decisions . . • ff (p. 35). Peterson, Ulehla, Miller, Bourne, and Stilson 
(1965) asked subjects in a binomial dice task '"what is the probability that 
Die W is generating the data? 11 and "what is the probability that the next roll 
will be White?" Their answers were conservative but consistent. Peterson, 
DuCharme and Edwards (1968) had subjects estimate P(H/D), then P(D/H). Then 
they were instructed in P(D/H) by being shown several theoretical sampling 
distributions from an urn and discussing them with the experimenter. For 
example, they observed how the distribution became more peaked as the number 
of draws and the dominant proportion increased. Finally they were again asked 
to estimate P(H/D). Peterson et al. found that the use of subjects 1 estimates 
of P(D/H) to predict estimates of P(H/D) accounted for all the conservatism of 
the latter responses. They also found that instruction about the sampling 
distributions reduced conservatism in the final stage , but the reduction was 
small in relation to the amount of conservatism. 

Subjects in the study by Peterson et al. did not have the theoretical 
sampling distributions available at the time they made post-instruction 
P(H/D) estimates. Pitz and Downing (1967) gave subjects similar instruction 
and, in addition, allowed them to refer to histogram displays of the 
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theoretical sampling distributions as they made predictions about which of 
two populations was generating the data. However, their predictions were not 
improved by this instruction. Wheeler and Beach (1968) trained subjects 

by having them observe samples of eight draws, make a bet on which of two 

/ 

populations generated the data, and then observe the correct answer. Prior 
to training the subjects 1 sampling distributions were too flat, their betting 
responses were conservative, and these two errors were consistent with one 
another. After training, the subjects 1 sampling, distributions were more 
veridical, their betting responses were less conservative, and again the 
two sets of responses were consistent. 

A particular kind of misperception error lies in the perception of the 

impact of rare events. Vlek (1965) suggested. that unlikely events, when they 

occur, are seen as uninformative. He argued for the compelling nature of 

this error by giving an exaggerated example, 

t! The posterior probability that a sample of 2004 
chips, 1004 of which are red, is taken from bag A 
(P^ = *70), and not from B (P = .30), is equal to 

.967. But who will accept hypothesis A as a possible 
generator of these data, and, if forced to do so, who 
dares to base an important decision of such a small 
difference in the — seemingly biased — sample? 11 
(p . 15 ) • 

The answer to his plaintive question is, of course, that Bayes 1 theorem 
dares . In the optimal model , it matters not at all that a datum may be 
highly unlikely under both hypotheses. The .only determinant of its impact 
is the relative possibility of its occurrance: the likelihood ratio. The 

violation of this likelihood principle has been shown by Vlek (1965) and 
Vlek and van der Heijden (1967), who show a systematic decrease in the 
Accuracy Ratio as a function of the rarity of the data, and it can serve 
as an explanation to Lichtenstein and Feeney f s (1968) results. Beach (1968) 
directly tested Vlek's idea. Beach constructed decks of cards, each with a 
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letter j from A to F 5 written on it in green or red ink. The task of the sub- 
jects was to estimate the posterior probability that the letters sampled were 
drawn from the green deck rather than the red deck, given complete information 
about the frequency of each letter in each deck. Two groups of subjects used 
different decks of cards; the likelihood ratios were the same between groups, 
but the relative frequencies of the letters differed between groups. This 
permitted a test of whether the impact of rare events was misperceived, with 
likelihood ratio held constant. The results verified Vlek's hypothesis; 
subjects were more conservative when responding to less likely events. 

Mis aggregation 

Another explanation of conservatism is that subjects have great 
difficulty in aggregating or putting together various pieces of information 
to produce a single response. Proponents of this view draw support from 
several sources (Edwards, 1968). First, they r point out that in the studies 
just reported as supporting the misperception hypothesis, subjects were 
shown samples of several data at once. When shown a sample of, say, 

6 red and 3 blue chips, and asked to state the probability that such a 
sample might occur, the subject must, in a sense, aggregate the separate 
impacts of each chip, even though the sample is presented simultaneously. 
Viewed in this light, both estimation of P(H/D) and of P(D/H) in studies 
like Wheeler and Beach (1968) are aggregation tasks; thus the consistency 
between the two tasks does not provide a discrimination between the misper- 
ception and mis aggregation hypotheses. Beach (1968), testing the rare event 
hypothesis, did present subjects with only one datum at a time, but he 
presented three data per sequence. Gettys and Manley (1968) reported two 
experiments in which five levels of frequency of data and five levels of ■ 
likelihood ratios were factorially combined in 100 binomial problems . For 
each problem the subject was shown the contents of two urns and the result 
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of a single sampling of one datum. In this situation with no aggregation 
required;, the rare event effect was not found. The subjects were sensitive 
to changes in likelihood ratio but not to differing event frequencies * The 
authors argued that the rare events affect found in other studies is attribut- 
able to aggregation difficulties, 

A related source of support for the mis aggregation hypothesis comes from 
the finding that subjects perform best on the first trial of a sequence, 
DuCharme and Peterson (1968) reported this finding based on a task using normal 
data generators. The subjects were shown samples of heights and asked the 
.posterior odds that the population being sampled was of men or women. They ' 
were virtually optimal for single-datum sequences and for the first trial 
of four-data sequences 9 but conservative on subsequent trials. This same 
result was, shown by Peterson and Swensson (1968) 9 using the usual binomial 
task and an unusual diffuse-hypothesis task, Foi the latter 9 the subjects 
were told to imagine an urn containing many thousand red and blue chips. 

The proportion of red chips was decided by a random procedure such that all 
percentages were equally likely * A defining sample (varying from 1 to 19 
chips) was then shown; the subjects were asked to make a point estimate of 
the proportion of red chips in the urn. Following each such estimation process 
the subjects were asked to imagine a second urn with proportions of chips 
just the reverse of the present (unknown) urn. They were then shown an infor- 
mational sample (one chip or 5 sequential chips) and asked to give posterior 
odds regarding which of the two urns had been sampled. It was these estimates 
as well as the estimates made to the more usual binomial task, that were 
very nearly optimal for one datum and the first datum out of five 5 but more 
conservative for data two through five. 

It might be noted that Peterson and Miller (1965) found conservatism 
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with just one datum per problem, but this presents no special problem for the mis- 
aggregation approach, since in that study the one datum had to be aggregated with 
a varying value for the prior probability of the hypothesis. In addition, Peterson 
and Miller used a probability response mode. This mode, as will be discussed 
later, is highly susceptible to a non -optimal but simple strategy which produces 
artifactual results, Peterson and Phillips (1966) also found first-trial conser- 
vatism using a probability response mode. DuCharme and Peterson (1968) and 
Peterson and Swensson (1968) avoided this criticism by asking for responses in 
terms of posterior odds rather than probabilities, and found first-trial optimality-. 

Finally, man's difficulties in aggregating data have been demonstrated in a 
series of man-machine systems studies. A system where men estimate P(D/H) 
separately for each datum and the machine combines these into posterior' proba- 
bilities via Bayes 1 theorem has consistently been found superior to a system 
where the man, himself, must aggregate the data into a P(H/D) estimate (Edwards, 
Phillips, Hays, & Goodman, 1968; Kaplan & Newman, 1966; and Schum, Southard, £ 
Wombolt, 1969). 

Both the misperception and mis aggregation hypotheses received support in 
a study by Phillips (1966; also reported in Edwards et al., 1968). His subjects 
misperceived the impact of each datum, and, in addition, were not consistent 
with that misperception in a subsequent aggregation task. 

Artifact 

The third explanation of conservatism, that conservatism is arti- 
factual, was originally suggested by Peterson (see Edwards, 1968), and 
has been recently supported and renamed response bias by DuCharme (in press). 
DuCharme hypothesized that subjects are capable — and optimal — when dealing 
with responses in the odds range from 1:10 to 10:1, but are conservative when 
forced, either by the accumulation of many data or by the occurrence of one 
enormously diagnostic datum, to go' outside that range. He pointed out that 
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such a response bias would explain many of the conservatism effects reported 
in the literature, including increased conservatism attributed to increasing 
diagnosticity and the superiority of first-trial performance. DuCharme 
tested his hypothesis directly in a task where subjects had to determine 
whether observed samples of heights came from- a- male or female population. 

His subjects gave sequential posterior odds estimates to sequences varying in 
length from one to seven data. The results supported the response bias 
hypothesis. First-trial estimates and later-trial estimates in the same 
probability range were similarly optimal. Second- and third-trial estimates 
were more conservative following a highly diagnostic first datum (LR - 99) 
than were estimates to those same data following an undiagnostic first trial 
(LR = 1.3). Optimality of response, across all trials, was marked within a 
central range of posterior odds, while conservatism occurred outside this 
range. 

Task Determinants in Bayesian Research 



The Effects of Response Mode 

D irect estimation methods . Phillips and Edwards (1966) compared four 
different direct estimation modes in a bookbag and poker chip task. The 
"probability" response was made by distributing 100 white discs in two verti- 
cal troughs, the height of the discs in each trough indicating the probabil- 
ities of the two hypotheses. The "verbal odds" response was a verbal statement 
of the posterior odds after each datum. The "log odds" group estimated 
posterior odds by setting a sliding pointer on a scale of odds spaced logarith- 
mically; the scale ran from 1:1 to 10,000:1. The "log probability" subjects 
used a similar sliding scale, labeled in probabilities rather than odds, where 
the spacing of the probabilities was determined by converting the probabilities 
to odds and scaling the odds logarithmically. 
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The motivation for using odds and log odds as response alternatives 
rested upon an uneasiness with the properties of a probability scale • The 
amount of change in posterior probabilities induced by a single datum 
decreases as the probabilities prior to the receipt of that datum become 
more extreme • 

In addition to this problem of nonlinearity between data and response, 
there is a potential problem with ceiling effects because of the boundedness 
of the probability scale at zero and one. The subject may be reluctant, 
in a long task, to give an extreme response early in the sequence, for fear 
of !t using up 11 the scale before the last data arrive. When odds and log. odds' 
are used, however,, both these difficulties are avoided; odds bear a constant 
multiplicative relationship to binomial data, while log odds are linearly 
related to such data. In addition, neither scale has a ceiling. 

The Phillips and Edwards results were consistent with the above 
reasoning: for three different bag-compositions and for five different 

sequences of 20 chips from each, the "verbal odds" and "log odds" groups 
showed the least conservatism. 

I ndirect methods . Instead of asking the subject for probabilities, 
indirect methods infer his probabilities from some other response. Sanders 
(reported in Edwards, 1966) used bookbag and poker chi.p situations to compare 
a direct response, verbal odds, with two different indirect responses, choice 
among bets and bidding for bets. He found substantial agreement, as measured 
by similarity of Accuracy Ratios across different diagnosticity levels, 
between the direct, verbal odds mode and the choice among bets response. The 
bidding mode produced considerably more optimal behavior than the other 
two modes. 

Beach and Phillips (1967) compared direct probability estimates with 
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probabilities inferred from choices among bets in a situation requiring 
subjects to learn the probabilities associated with the flashing of seven 
lights. They found that estimated and inferred probabilities correlated .93 
(slope = 1.06) 3 averaged across 20 subjects. Strong agreement between 
probability estimates and probabilities inferred from bids, has also been 
found in two studies by Beach and Wise (1969a, c). However, Beach and 
Olson (1967) have shown that probabilities inferred from choices among 
bets were highly susceptible to the gambler’s fallacy (e.g., subjects 
overestimated the probability of a red after four greens were sampled, and 
underestimated it after four reds occurred), while direct estimates of 
probabilities were much more optimal. 

Geller and Pitz (1968) have explored the use of decision speed, measured 
without the subject’s knowledge, in a bookbag and poker chip task. Prior 
to each sampled chip, subjects predicted the color of the chip; after the 
chip was shown, subjects guessed which hypothesis was true (it was this 
decision that stopped the clock); then subjects indicated their certainty 
in the decision by assigning a confidence judgment to the chosen hypothesis. 
These confidence judgments have been shown to be consistently related to 
probability responses (Beach 6 Wise, 1969b). A high correlation was found 
between the speed of decision and the Bayesian probability that the decision 
was correct. In addition, relative changes in decision speed approximated 
optimal changes in probability more closely than did changes in confidence. 

E ffects of intermittent responding . Perhaps the very act of making 
repeated responses, once after each datum is presented, affects the final 
response of the subject. This hypothesis was tested by Beach and Wise (1969c), 
who compared verbal estimates of posterior probabilities made only at the 
end of a sequence of three data with estimates made after each datum. They 
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found satisfactory correspondence between the two estimate methods. Pitz 
(1969a), however, using sequences of ten data, did find differences attri- 
butable to repeated responses. Four of his groups gave confidence ratings 
after each datum; the groups differed in the degree to which the responses 
made on previous trials were available to them. The fifth group gave confi- 
dence ratings only after seeing all ten data. The group that made repeated 
responses in a way that was most difficult to remember performed similarly to 
the group making only a single , final response , but the results from other 
groups with repeated responses showed a non-optimal sequence effect. Halpern 
and Ulehla (1970), using a signal detection task, also found differences 
between repeated responses and a single, final response; the latter more 
closely matched an internal-consistency prediction derived from signal 
detection theory than did the former. 

Nominal v s. probabi lity responses . Another question of interest is 
whether there is any difference between a nominal response (yes-no; pre- 
dominantly red-predominantly blue) and a probability response which is 
later converted to a nominal response by the experimenter. Swets'and 
Birdsall (1967), using an auditory detection task, found that the probability 
response data provided a better fit to the signal detection model than the 
nominal-response data. Similar results were found by Ulehla, Canges, and 
Wackwitz (1967). Unfortunately, Halpern and Ulehla (1970) found exactly the 
opposite results in a visual discrimination task. 

Using a Bayesian task with three hypotheses, Martin and Gettys (1969) 
found better performance using a nominal response than using a probability 
response. Attaching probabilities to two less likely hypotheses as well as 
to the favored hypothesis was apparently difficult enough to degrade 
subjects 1 performance. 
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The Effects of Payoffs 

The use of payoffs in probability estimation tasks may have a motivational 
effect 3 persuading the subjects to try harder, and an instructional effect, 
helping subjects to understand what the experimenter wants from them (^inkier £ 
Murphy j 1968). These effects were explored by Phillips and Edwards (1966), 
who used three different payoff schemes and a control group in a bookbag and 
poker chip task. The subjects estimated the posterior probability of each 
bag for 20 sequences of 20 draws each. The control group received no payoff 
but were told which hypothesis was correct after each sequence. The three 
payoff groups were paid v(p) points , later converted to money, where p 
was the subject’s estimate for the correct hypothesis, and v(p) was calculated 
as follows : 

Quadratic: v(p) = 10,000 - 10,000 (l-p) 2 

Logarithmic: v(p) = 10,000 + 5,000 log^ Q p 

Linear: v(p) - lO.OOOp 

The quadratic and log payoffs share the characteristic that the subject 
can expect to win the most points by reporting his true subjective probability 
(Toda, 1963). For the linear payoff, the subject ought always to estimate 
.1.0 for the more likely hypothesis. 

The results indicated that payoffs help to decrease conservatism, but do 
not eliminate it. The log group was better* than the quadratic group, which 
differed little from the control group. The linear group made many extreme 
estimates (reported probabilities larger than the Bayesian probabilities), 
reflecting a tendency in the direction of the optimal strategy of reporting 
all estimates as 1.0 or 0. The instructional value of payoffs was reflected 
in more learning by the payoff groups than the control groups, and by the 
lower between-subject variance for the payoff groups. 
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These findings were amplified in a study by Sebum 9 Goldstein, Howell, and 
Southard (1967) using a complex multinomial task with six hypotheses and 4, 8 9 
or 12 data, of varying diagnosticity , in each of 324 sequences. Three payoffs 
were used 9 based on the subject ! s estimated posterior probability at the end 
of each sequence. 

Logarithmic: v(p) = 10 + 12.851 log 

Linear: v(p) = 12p - 2, and 

All-or-none : S_ received 10 points if the hypothesis to which 

he assigned the largest posterior probability was 
correct; otherwise he received 0 points. 

In the all-or-none payoff scheme the size of the posterior probabilities was 
irrelevant, and the payoff provided no instructional feedback to the subject 
regarding the size of his response. Nevertheless, the all-or-none group was 
only slightly inferior to the log group; both groups were conservative 
except with the short (four-item) sequences. The linear group was not, on 
the average, conservative, but the responses were highly variable: the 

posterior odds inferred from subjects 1 responses were as likely to be 50 times 
too great or too small as they were to be accurate. When the responses were 
simply scored as "correct," meaning that the true hypothesis received the largest 
estimated posterior probability, or "incorrect , n differences among the 
payoff groups were eliminated. 

Whereas the studies just described varied payoffs as a function of 
slight differences in probability estimates, Pitz and Downing (1967) studied 
the effects of payoffs on a much grosser level of response — namely binary 
predictions. Subjects were asked to guess which of two specially-constructed t 
dice was being rolled, after five data were presented. Five different payoff 
matrices were used. The first matrix was symmetric, in that rewards and 
penalties were the same for both dice. The other matrices were biased. In 
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order to maximize their expected winnings, subjects should alter their 
strategies when payoff matrices are biased. For example., they should guess 
the less likely die when the reward for being correct is great and the cost 
for being wrong is small, relative to the payoffs associated with the other 
prediction. The introduction of biased payoffs is a second way in which this 
study differs from those described above. The subjects were highly optimal 
when using the symmetric matrix. But although they altered their predictions 
as a function of varying payoffs, they did not change nearly enough; 
they were unwilling to make responses which had a smaller probability of being 
correct, even though, because of the biased payoff s , these responses would 
have increased their expected gains. Pitz and Downing suggested that subjects 
have a high utility for making a correct guess. A similar suggestion was made 
by Ulehla (1966), who found essentially the same result in a study of percep- 
tual discrimination of lines tilted left or right. With a symmetric payoff 
scheme, subjects closely fit the signal detection model, but biased payoffs 
led to insufficient change in strategy. 

The Effects of Diagnosticity 

One of the simplest ways of varying the diagnosticity of the data in a 
probability estimation task is to change the data generator. In a bookbag and 
poker chip experiment , the diagnostic impact of a sample of one red chip is 
greater when the bag being sampled contains 80 red, 20 green or 20 red, 80 
green than when the possible contents of the bag are more similar, say 
60 red, 40 green versus 40 red, 60 green. Several experiments (Peterson, 
DuCharme, 6 Edwards, 1968; Peterson 6 Miller’, 1965; Phillips 6 Edwards, 1966; 
Pitz, Downing, 6 Reinhold, 1967; and Vlek , 1965) have manipulated diagnosti- 
city in this way and all have found greater conservatism with more diagnostic 
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data. Very low levels of diagnosticity sometimes produce the opposite of 
conservatism: subjects * • responses are more extreme than Bayes* theozsm 

specifies (Peterson £ Miller, 1965). 

When the data generator is a complex multinomial system, different 
samples or scenarios can differ greatly in total diagnosticity, i.e., in the 
certainty with which the sample points to one of several hypotheses. Studies 
by Martin and Gettys (1969), Phillips, Hays, and Edwards (1966), and Schum, 
Southard, and Wombolt (1969), all showed that scenarios of higher overall 
diagnosticity lead to greater conservatism. Martin and Gettys found that 
their least diagnostic scenarios produced the same extremeness of response 
(opposite of conservatism) as found by Peterson and Miller (1965) in a binomial 
task. 

Another way of varying diagnosticity is to vary sample size. In general, 
the larger the sample, the more diagnostic it is. Pitz y Downing, and Reinhold 
(1967), and Peterson, DuCharme and Edwards (1968), using binomial tasks, and 
Schum (1966b, also Schum, Southard, £ Wombolt, 1969) using a multinomial 
task, have shown that larger sample sizes yield greater conservatism. Diagnos- 
ticity can be held constant across different sample sizes, however. In any 
binomial task, diagnosticity is solely a function of the difference between 
the number of occurrences of one type and of tbe other type. Thus the- occur- 
rence of 4 reds and 2 blues in a sample of 5 chips has the same diagnosticity 
as the occurrence of 12 reds and 10 blues in a sample of 22 chips. Studies 
by Vlek (1965) and Pitz (1967) show that when this difference is held constant, 
the larger sample sizes yield lower posterior estimates, hence greater conser- 
vatism. However, when Schum, Southard, and Wombolt (1969) held diagnosticity • 
constant in a multinomial task, variations in sample length had no effect 
upon the size of subjects* final posterior probability estimates. The method 
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used for holding sample diagnosticity constant as sample size increases differs 
in the binomial and multinomial task. In the binomial task this is effected 
by including several data with equal but opposite diagnostic values (red and 
blue chips) which cancel each other out. But in Schum’s task with six 
hypotheses and data of varied diagnosticity, total diagnosticity can be 
held constant only by using individual data in the 18-data sample that are each* 
on the average, of low diagnosticity. compared to the data used in the 6-data 
sample. Since data of low diagnosticity have been shown to produce less 
conservatism, this may account for the discrepancy between Schum's findings 
of no sample-size effect and the finding of large effects by Vlek (1965) and 
Pitz (1967). 

Sample size and diagnosticity can also be varied by holding the total 
number of data constant and varying the number of data presented to the subject 
at any one time. Peterson, Schneider, and Miller (1965) presented subjects 
with 48 trials of one datum each, with 12 trials of 4 data each, with 4 trials 
of 12 data each, and with a tingle trial containing 48 data. Conservatism 
was large when subjects responded after each single datum, but was even 
larger when the number of data (and hence the average diagnosticity) per 
trial increased. Vlek (1965) also found poorer performance with simultaneous 
than with successive presentation of data. 

All these studies tell the same story: increased diagnosticity, no 

matter how produced, increases conservatism. The sole exception to this 
statement is reported by Schum and Martin (1968), who used a multinomial 
task — six hypotheses and six data per scenario. They used two different 
data-generating models. Model A and Model B. Both models were simpler than 
the multinomial models used in other research in that every possible datum 
favored just one hypothesis, with the five other hypotheses being equally 
less likely. The impact of a single datum upon the hypothesis favored by 
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that datum was always the same within a model , but differed between models. 
Each datum from Model A was more diagnostic than each datum from Model B. 

Any scenario of six data could be ‘characterized by the number of data 
favoring- the most-favored hypothesis. At one extreme , each datum could 
favor a different hypothesis; then the posterior odds between any pair of 
hypotheses was 1.0. At the other extreme, all six data could favor the same 
hypothesis; this would be the most diagnostic case. There were nine other 
cases of intermediate diagnosticity . 

Each subject gave posterior probability responses to 264 scenarios 
(six data presented, simultaneously); there were 12 different scenarios for • 
each of the 11 diagnosticity cases for each of the 2 models. The appropriate 
conditional probability matrix (either Model A or Model B) was always 
displayed to subjects. Subjective log likelihood ratios were computed from 
the probability estimates and compared with Bayesian log likelihood ratios. 

The results from Model A were typical of the diagnosticity studies previously 
mentioned — subjects were sensitive to changes in the diagnosticity, but 
as diagnosticity increased, subjects became increasingly conservative. 

The results from Model B scenarios represented a unique finding — as 
diagnosticity increased in Model B scenarios, extremity of response increased. 
Seven of the eight subjects showed this effect; the other subject was 
slightly more variable but otherwise nearly optimal. This finding is unex- 
plained by Schum and Martin. One possible explanation is that subjects 
completely disregarded the difference between Model A and Model B, responding 
solely to the number of items favoring the most likely hypothesis. The 
subsequent comparison of such responses with the optimal responses derived 
from the two different models would make similar strategies look conservative 
in one case and extreme in the other case. 
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The Effects of Manipulating Prior Probabilities 

The results of several studies in which the prior probabilities were 
systematically varied are mixed. Three studies found no systematic or 
significant effect of this variable upon subjects* responses: Phillips and 

Edwards (1966) reported no effect on Accuracy * Ratio attributable to five 
different prior probability levels in a bookbag and poker chip experiment. 
Phillips 3 Hays 3 and Edwards (1966) 5 and Schum (1966b) reported no effects 
due to change in priors in multinomial tasks. 

Strub (1969) j using a binomial task 3 observed that subjects* terminal 
.posterior probability estimates after 100 data were higher for priors of 
.90 - .10 than for priors of .50 - .50 3 but he did not report his data in 
sufficient detail to determine whether one condition produced more optimal 
behavior than the other. 

Peterson and Miller (1965) 3 recognizing that the place to look for the 
effect of priors is right at the beginning of data accumulation 3 rolled one 
of two dice just once for every "sequence . 11 They used nine levels of prior 
probabilities j from .1 to .9, and found that subjects* Accuracy Ratios 
increased (became less conservative) as the priors became more extreme 
(departed from .5). This clear-cut finding 3 however 3 may be an artifactual 
result of the response mode — probabilities expressed with a sliding pointer 
on an equal-interval scale. If subjects simply moved the slider a constant 
amount j up for a black datum 3 down for a white datum 3 regardless of its 
initial setting 3 the reported relationship between the Accuracy Ratio and 
prior probabilities would occur. 

The one general characteristic of the Bayesian research summarized so 
far is that subjects are never as sensitive to the experimental conditions 
as they ought to be. This statement characterizes conservatism itself 3 as 
well as the effects of payoffs and diagnosticity . The above findings 
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regarding priors are too inconclusive to fit in this mold;, but exactly this 
result of varying priors has been found using signal detection models by 
Ulehla (1966) and Galanter and Holman (1967). Wendt (1969) also found 
partial sensitivity to priors. He asked his subjects to bid for each datum; 
this bid was interpreted as the value of the datum for the subject. Wendt 
found that the bids were closer to optimal when the prior odds were 1:1 than 
when the priors were extreme. 

The Effects of Sequence Length 

Several studies have found that subjects are more hesitant to commit 
themselves fully to a probability revision when they know that there will 
be opportunity for additional revision on later trials than when they know 
any revision taking place must be made immediately. In one 3 Vlek (1965) 
compared P(H/D) estimates made after the ninth trial in a 19-trial sequence 
with estimates made after the simultaneous presentation of nine data items 
(no more were to be presented). The probability estimates were less extreme 
in the former condition where subjects knew they had ten additional oppor- 
tunities for revisions. This effect might be attributed to the difference 
between simultaneous vs. serial presentation in the above study. However a 
Pitz 3 Downing 3 and Reinhold (1967) used serial presentation with responding 
after each item and found the average revision of P(D/H) to be greater for 
shorter sequences than for longer ones. Similarly 3 Shanteau (1969) 
found that shorter sequences produced more extreme P(H/D) responses at any 
serial position 3 holding the evidence constant. Although none of the above 
studies put any pressure on subjects to make their intermediate responses 
maximally accurate 3 Roby (1967) used a payoff system to motivate subjects 
to be accurate at every response point and he 3 too 3 found that they tended, 
to delay for several trials before modifying their estimates. 
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Sequential Determinants of Information Use 

More often than not information is presented to a judge or decision 
maker in sequential order. In some cases, the decision is evaluated after 
each new item of information. In others the- decision maker must wait until 
all the information has been presented before responding. In both cases it 
is of interest to determine whether the way that information is used depends 
upon the order in which it appears in the sequence. 

Primacy and Recency Effects 

Without a doubt the most thoroughly investigated type of sequential 
effect aims to determine whether information presented early in a sequence 
is more or less influential than information presented later, other things 
being equal. Greater influence of early information is called primacy. Its 
opposite effect is called recency. The issue seems to have been studied 
first by Lund (1925) who presented evidence supporting a "Law of Primacy in 
Persuasion" but the modern impetus can be attributed to the work of Asch 
(1946). Asch had subjects judge the favorableness of a person described by 
six adjectives. When these adjectives were read in decreasing order of 
favorableness (i.e,, intelligent, industrious, forceful, critical, stubborn, 
envious) the final impression was more favorable than when the reverse order 
was used, indicating a primacy effect. Asch hypothesized that the initial 
adjectives set up a "directed impression" that caused the later adjectives 
to shift their meanings to conform to the existing set. This research 
stimulated a body of research on order effects in persuasion, summarized in 
a book edited by Kovland (1957). Even by this early "date it was evident 
that there was no completely general principle of primacy and recent work 
has borne this out, focusing instead on delineating some of the situational 




77 



factors influencing order effects . 

Table 3 presents an- overview of more than two dozen studies of primacy 
and recency. We have grouped these studies into three categories according 
to the type of stimulus information with which the judge had to deal. The 
first group of studies involves verbal items ,* such as adjectives to be 
integrated into an overall impression of a person, foods descriptive of a 
meal, headlines descriptive of a paper, etc. In the second class of studies, 
subjects were presented with simple quantitative or perceptual inputs, either 
numbers, weights, or lengths of lines, which had to be averaged. Group III 
consists of studies where the subject had to make probability estimates or 
predictions about the true state of the world on the basis of sample data. 



Insert Table 3 about here 



Within each major class, the studies have been further subdivided into 
two categories, depending on whether the judgment was made only after the 
final item of the information sequence (coded F) as opposed to being made 
after each item of information or after several but not all of the data were 
received. These latter two conditions have been coded I, for intermittent. 

Category I; Verbal information. Studies involving verbal items of 
information have typically employed some version of the following design to 
assess order effects. First the items are scaled individually with respect 

i 

to the criterion. These sets are then presented in ascending scale order and 
vice versa, as in the Asch study. A related procedure first sorts items into 
homogeneous subsets having high (H) or low (L) scale values. Then blocks of 
H and L items are presented in varying order. For example, primacy would 
lead the final judgment for a HHLL sequence to be higher than that for a 
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LLHH sequence. Recency would prodiice the opposite effect. Another related 
design was employed by Anderson (1965b) who interpolated a block of LLL or 
HHH items into a sequence containing 3 or 6 other items, all with opposite 
scale values. For example, in the sequences: 

1. LLLHHH 

2. HLLLHH 

3. KHLLLH 

4. HHHLLL 

primacy would result in increasing judgment as one proceeded from sequence 
1 to 4 and the LLL block moved towards the latter part of the sequence. 

The results of studies in Category I can be summarised as follows. When 
only one judgment is made, at the end of the sequence, ten studies reported 
primacy effects, three observed recency, and two found no effect. One study 
showing recency, Anderson and Hubert (1963), required subjects to recall the 
adjectives just after making their rating. When recall was not required, 
primacy occurred. These results were interpreted as indicating that primacy 
was caused by decreased attention to the later adjectives. Recall presumably 
eliminated primacy by forcing attention to the later adjectives. Anderson 
(1968a) also found recency in a study that departed slightly from the typical 
design. Instead of using high and low items In the same .sequence, Anderson 
used moderately high (M+) and high (H) or moderately low (M-) and low (L) 
items, thus reducing the glaring inconsistencies that usually occur when H 
and L items are included together. The fact that primacy was not obtained 
here led Anderson to propose that its existence in the other studies was 

i ■ ' 

due to subjects discounting the later, inconsistent evidence, much as occurred 
in the Anderson and Jacobson (1965) study of stimulus inconsistency. Hendrick 
and Constantini (1970a) examined both the attention decrement and inconsis- 
tency explanations of primacy. They varied the degree of perceived 
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inconsistency among sets of three high- and three low-valued adjectives. They 
also required half of their subjects to repeat each adjective after it was 
presented. Variation of inconsistency was found to have no effect. A 
strong primacy effect was observed except in the situation where subjects 
had to pronounce the adjectives. There, recency occurred. The authors 
concluded that these results supported the attention decrement hypothesis 
and they argued that the recency found in the Anderson (1968a) study 
occurred not because of the low degree of inconsistency in that study but 
because Anderson’s subjects also pronounced the adjectives. 

A study by Anderson (1965b) indicated linear primacy effects. The 
earlier the information was presented in the sequence, the greater its effect, 
by a constant amount, in the sequence. Anderson proposed a weighted average 
model to account for the data, where the weights declined as a linear function 
of ordinal position in the set. 

Although Anderson and Norman (1964)' argued that primacy seems unlikely 
to stem from a change in meaning effect, later studies (Anderson, 1966; 
Anderson 6 Lampel, 1965; Wyer 6 Watson, 1969) have shown that such contextual 
effects do occur and Chalmers (1969) proposed that change of meaning be 
incorporated formally into Anderson’s weighted average model. 

Turning to the studies in Category I, where subjects responded inter- 
mittently as stimulus information was acquired, a radically different picture 
emerges — recenc}*- strongly predominates. . It is not clear why making judg- 
ments during the sequence should lead to recency whereas making only one 
judgment at the end of the sequence generally produces primacy. Luchins 
(1958) presented subjects with two blocks of highly inconsistent information 
about an individual, in paragraph form. They filled out two detailed 
questionnaires about the subject of the paragraph, one after each block. 
Luchins argued that the inconsistencies were accentuated by the first 



81 



questionnaire, making it difficult for subjects to assimilate the later 
information and causing them to respond to the second questionnaire in terms 
of the second block of informations hence a recency effect. Why this clearly 
inconsistent information was not discounted, however, as seemed to occur in 
an earlier study (Luchins, 1957) where only final responses were given, is 
an unanswered question, Luchins (1958) observed that his subjects did not 
regard themselves as committed to the opinions they had expressed on the first 
questionnaire, often giving diametrically opposite answers on the second 
administration, 

Stewart (1965) argued that responding after each new adjective forces 
subjects to pay equal attention to each one and to weight the new information 
and the old impression equally. Although equal attention and equal weighting 
might seem to predict neither primacy nor recency, Anderson (1959) showed that 
equal weighting of new information and an old impression, in a situation 
where there is an initial impression prior to seeing the new information, will 
necessarily produce recency. 

The study by Anderson (1959) deserves special mention because it ex- 
plicitly tested for order effects across a long sequence (16 items) of 
relatively complex arguments in a trial setting. A recency effect was ob- 
served early but decayed and was replaced by a primacy effect later in the 
sequence. Anderson hypothesized that opinion is made up of two parts, a 
superficial component that is quite labile and produces recency, and a basal 
component which forms slowly and is then relatively little influenced by new 
communications. This resistance to change results in a primacy effect. 

Category II; Numbers, weights, and lines . Stimuli used by studies in 
this category are described by information that is unlikely to produce 
. strong feelings of incongruity in subjects and in this way they differ from 
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those in Categories I and III. The six studies in which subjects averaged 
numbers, weights, or lines have typically employed factorial designs to 
assess the effects of serial position. 

With only one exception, the studies in this category found recency 
effects. Whether or not judgments were made intermittently seemed to make 
little difference. The exception is a study by Hendrick and Constantini 
(1970b) that obtained primacy in number averaging when subjects responded 
only after the final stimulus. In this respect, number averaging is similar 
to the integration of verbal items (Category I). 

Anderson (1967a) noted that contrast effects might be one cause of the 
pervasive recency phenomenon for lifted weights. For example, if weight L 
is felt lighter in sequence HHL than in LHH., the data would show a recency 
effect. ' 

Weiss and Anderson (1969) hypothesized that memory and storage require- 
ments might determine the recency they found in intermittent judgments of 
average length of lines. To carry this idea further, it may be that subjects 
do not preserve the individual memories of previous lines, numbers, or weights 
when responding intermittently. These may tend to lose their identity when 
integrated into an earlier impression. When the time comes to integrate 
another item into the impression, subjects give the new item more nearly 
equal weight rather than weighting it by the reciprocal of n, the number of 
items in the sequence. In other words, if the subject does not keep in mind the 
number of previous items, his subjective perception of n may undergo temporal 
decay. Although Weiss and Anderson conducted one test of this hypothesis 
without obtaining substantiating results, it would appear to merit further ■ 
consideration. Weiss and Anderson did find less recency when judgments were 
made only at the end of a sequence. 
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Category III; Probabilistic information * Studies investigating sequen- 
tial use of probabilistic information have generally required subjects to make 
judgments after each new datum is presented. Five such studies have reported 
primacy effects and two have obtained recency. The dominance of primacy here 
contrasts with the recency effects generally found when subjects make inter- 
mittent judgments upon receipt of verbal or numerical information. Why? One 
possible explanation is the fact that the studies by Peterson and DuCharme 
(1967), Roby (1967), and Dale (1968), each presented subjects with a long 
sequence of items of information that first pointed, strongly to one hypothesis 
and then suddenly changed in character so that the less favored hypothesis 
became at least as probable as the first. The resulting inconsistency of the 
latter data is extremely implausible in a stationary environment and it is 
not surprising that subjects tended to discount those data. Neither of the 
two studies obtaining recency effects used such strongly inconsistent data 
sequences. 

Summary of primacy-recency studies . It appears that order effects are 
highly pervasive phenomenon, appearing in studies employing quite diverse 
stimuli and response modes. Whether recency or primacy occurs seems very 
much dependent upon the task characteristics. Primacy is usually found 
when the subject responds only at the end of the sequence- and the later 
information is highly incongruent with the .earlier data. When recall or 
pronunciation of the stimuli is required or when judgments are made during 
the sequence itself, recency predominates. However, when strong committments 
have been developed on the basis of early information and the recent infor- 
mation is extremely implausible (Category III) even intermittent responding 
produces primacy. When the information is homogeneous in nature and not 
likely to create ■ feelings of incongruency (Category II), recency is observed. 
Although many hypotheses have been proposed to account for these data, their 
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causes remain to be precisely determined* 



An Inertia Effect in Bayesian Research 

The property of inertia was attributed to opinions by Anderson (1959) 
in the course of discussing the concept of a "basal component" — that part 
of an opinion which becomes increasingly resistant to change as information 
accumulates. More recently Pitz and his associates have conducted a series 
of studies demonstrating the existence of "inertia" in studies where opinions 
are formed and revised on the basis of probabilistic evidence. 

Pitz, Downing, and Reinhold (1967) found that subjects revised their 
P(H/D) estimates much less following evidence contradictory to their 
currently- favored hypotheses than they did after confirming evidence. Re- 
vision should have been equal in either direction. Especially interesting 
was the finding that probability estimates sometimes moved towards greater 
certainty after a single disconfirming datum was observed. This phenomenon 
was labeled an "inertia effect" and was also found by Geller and Pitz (1968). 

Geller and Pitz investigated two possible explanations of the inertia 
effect. The first says that inertia stems from strong committment to a 
hypothesis whereby subjects become unwilling to change their stated level 
of confidence even though their opinions might change. This hypothesis was 
suggested by findings in studies by Gibson and Nichol (1964), Brody (1965), 
and Pruitt (1961). Pruitt found that subjects required more information 
to change their minds about a previous decision than to arrive at that 
decision in the first place. Brody found that initial committment to an 
incorrect decision slowed down the rate of increase in confidence for the 
correct choice. Geller and Pitz obtained data indicating that subjects 1 
speed of decision decreased markedly following disconfirming evidence even 
though the stated confidence in that decision had not decreased. They 
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argued that this supported the committment hypothesis and also concluded that 
stated confidence may not indicate the subject’s true opinions. A second 
hypothesis tested by Geller and Pitz was that subjects may expect an occasional 
disconfirming event to occur when information is probabilistic. For example, 
if the task is to determine whether the samples of marbles are coming from an 
urn that is 60% red and 40% blue or vice versa and the first 9 draws produce 
6 red and 3 blue marbles, the drawing of a blue on the next trial may not be 
upsetting to subjects who believe the urn to contain 40% blue marbles. When 
subjects were asked to predict the next event in the sample, Geller and Pitz 
found that the inertia effect was greater following predicted disconfirming 
events than non-predicted disconfirming events and this \?as taken as support 
for the second hypothesis. 

Further evidence for the committment hypothesis comes from a study by 
Pitz (1967). His subjects stated their confidence in their opinions only 
after an entire sample was presented. When confidence was plotted as a 
function of increasing sample size, with Bayesian probabilities held constant, 
mean- confidence judgments decreased rather than increasing as would be pre- 
dicted from the inertia effect. This lack of inertia was attributed to the 
fact that there was no prior judgment to which subjects would have been 
committed. A. later study (Pitz, 1969a) found that when subjects were -not 
allowed to keep track, of their trial-by-trial responses, inertia was elimi- 
nated. 

Pitz (1966) had subjects make sequential judgments of the proportion of 
particular events in a sample. When subjects’ previous judgments were dis- 
played to them or could be recalled, their estimates shewed a delay in 
revision towards .5 that seems analogous to the inertia effect found in 
studies of confidence or subjective probability. Here, too, a group whose 
previous judgments were not displayed showed no such effect. 



86 



The inertia effect can be thought of as a type of primacy effect. The 
fact that inertia is so dependent upon the degree to which subjects v> previous 
judgments are displayed or otherwise highlighted suggests that this same 
factor might also be operating in some of the primacy-recency studies dis- 
cussed above. It is perhaps relevant that most of the studies • in Category I 
(see Table 3) that employed intermittent responding and obtained recency 
effects used spoken ratings, slash marks, or required subjects to fill out 
detailed questionnaires. None of these formats gives particular salience 
to previous judgments. The one study that exhibited primacy effects 
(Anderson, 1959) employed a more standard written response, although subjects 
did have to turn the page for each new item of information. In addition, 
each of the studies in Category III that obtained primacy (Dale, 1968; 
Peterson 6 DuCharme, 1967; Roby, 1967) had subjects make estimates on some 
mechanical device that preserved the previous response and required it to 
be physically manipulated when changes were made. All this is obviously 
,! post hoc” analysis but it seems to indicate that future research on primacy 
and recency should take a close look at the effect of the way in which the 
previous response in the sequence is made and stored. 

Learning to Use Information 

There has been considerable investigation into the learning of infor- 
mation processing and judgmental skills. Our focus here will be on studies 
in which the subject has to learn to use information to make a prediction or 
judgment. We shall neglect a rather sizable literature that explores whether 
subjects can learn to detect correlational or probabilistic contingencies 
among events but does not require, that this knowledge be used in decisions. 
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Regression Studies of Learning 

Researchers working within the regression framework and, in particular, 
with the lens model, have been quite interested in learning (see, for example, 
the chapters by Naylor and Bjorkman in this book). In fact., learning could 
be categorized, along with the problem of modeling, as a focal topic within 
the correlational paradigm. One way to partition the studies that have been 
conducted is according to whether subjects had available only one cue or 
multiple cues in their learning task. 

Single cue learning . Research with single cues has focussed upon what, 

Carroll (1963) called "functional learning." Carroll attempted to discover 

whether subjects could learn the functional relationships between a scaled 

cue or stimulus variable, X, and a scaled criterion, Y. The environment was 

deterministic; i.e., there was a perfect 1-1 correspondence between X and 

Y. Across tasks, Carroll varied the mathematical complexity of the functions 

as determined by the number of parameters needed to describe them. He found 

that subjects 1 responses seemed to follow continuous subjective functions, 

even when the stimuli and criterion feedback were randomly ordered. Not 

surprisingly, simple functions were learned best. Later work by Bjorkman 

(1965) and Naylor and Clark (1968) centered around the relative ease <?f 

learning positive vs. negative linear functions both in deterministic and 

probabilistic settings. In the latter studies the degree of predictability 

was manipulated and described in terms of the absolute magnitude of r £ (note 

that, in single-cue studies, r. is equivalent to r ; similarly, r. equals 

i , e . e 

r and b. and b. equal b and b respectively). The results of these 
s i,e 1 ,s e s 

studies indicated that positive relationships between cue and criterion are 
learned much more readily than negative ones. 

Bjorkman (1968) was interested in what he called "correlation learning," 
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defined as functional learning where error (r < 1.00) was involved. He 
observed that correlational learning requires a subject to learn both a 
function and the probability distributions around the regression curve. In 
one experiment he found that the variance of a subject’s responses about 
their own regression curve decreased as a consequence of training. A second 
experiment varied the extent to which there was a definite function to learn. 
Conditions with less pronounced cue-criterion trends resulted in larger ratios 
of subjects’ response variance to criterion variance. From these results 9 
Bjorkman concluded that correlational tasks are learned through a two-stage 
process involving both functional learning and probability learning with the 
former being temporally prior to the latter. 

Conservatism in single-cue learning . A particularly interesting issue 
in single-cue learning is concerned with determining whether or not subjects 
in these studies exhibit conservatism such as is evidenced in Bayesian studies 
of performance. Several results have been brought to bear upon this matter 
but they must be viewed cautiously because of the problems of assessing con- 
servatism in correlational tasks. For example 9 Naylor and Clark (1968) 
measured conservatism by dividing the stimulus distribution into thirds and 
computing the variance of each subject’s responses within each third of the 
range. These variances were compared with the, variances of the criterion 
values computed over the same sub-ranges. The assumptions underlying this 
measure are (a) that the criterion distribution reflects the true probabili- 
ties of the various hypothesis states within each sub-range of cue values and 
(b) that a subject’s distribution of point responses represents an adequate 
picture of his perceived subjective probabilities for each of these hypothesis 
states. Given these assumptions 9 Naylor and Clark’s subjects were conser- 
vative inasmuch as the average dispersion of their judgments was found to 
exceed the dispersion of the criterion values — particularly in the upper 
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and lower thirds of the cue distribution and for high values of r •' 

Naylor and Clark also proposed that the standard error of estimate 

(V*l-r 2 ) could be taken as an index of conservatism. Conservatism was pre- 
s 

sumed to increase this index, leading subjects to scatter their responses 
rather than consistently predicting the same criterion value, given a 
particular cue value. By this measure, Naylor and Clark’s subjects, as 
well as subjects in studies by Bjorkman (1965), Gray (1968), Gray, Barnes, and 
Wilkinson (1965), and Schenck and Naylor (1965), were not conservative. In 
these studies, r g typically exceeded r^ and the discrepancy, (r g - r^) was 
inversely related to r . Thus the two measures proposed by Naylor and 'Clark .. ■ ■ 
lead to opposite conclusions about conservatism. . • 

Brehmer and Lindberg (in press) have criticized the above conclusions arguing 
that conservatism really means that subjects do not change their inferences 
as much as they should when the cue values change. They argued that the 
indices used by Naylor and Clark confound two sources of variance — the 
consistency of the subjects and their conservatism or extremeness. Therefore, 
Brehmer and Lindberg proposed that conservatism be assessed by the relation- 
ship between b Q and b^ the slopes of the regression lines relating the 
criterion values and judgments to the cue dimension. 

The experiments by Gray (1968), Gray et al. (1965) and Naylor and Clark 

(1968) found that b exceeded b for low values of r (and b ) but not for 
s e e e 

high values . Since r and b were confounded in these studies , Brehmer and 

e e 

Lindberg decided to vary r , holding b constant. Lower values of r 
, e e s 

simply had greater deviation’ about a regression line that was the same for 
each condition. They found that subjects’ judgments were consistently more 
extreme than the criterion values; i.e., b g was greater than b^. This was 
especially true when r^ was low. This result, along with similar findings 
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by Gray s and Naylor and Clark, was interpreted as indicating that subjects 
are not conservative in ‘this type of task. 

Multiple-cue learning. Multiple-cue research has taken a great variety 
of forms. Most of the studies rely upon the lens model for conceptual and 

analytical guidance. Several have varied the number of cues, their r. 

l ,e 

values and the multiple correlation, r^, the forms of the functional relation- 
ships between cues and criterion, and the intercorrelation between cues. 
Typically, the subject is presented with a set of cues, he makes a quanti- 
tative judgment on the basis of these cues, and then receives the criterion 
value as feedback. Among the major results are (a) subjects can learn to use 
linear cues appropriately (Lee S Tucker, 1962; Smedslund,, 1955; Summers, 1962, 
and Uhl, 1963); (b) learning of nonlinear functions occurs but is slower and 
less effective than learning of linear relationships (Brehmer, in press; Ham- 
mond 6 Summers, 1965; Summers, 1967; and Summers, Summers, 6 Karkau, 1969) and 
is especially difficult if subjects are not properly forewarned that the re- 
lations may be nonlinear (Earle, 1970; Hammond 6 Summers, 1965; and Summers 6 
Hammond, 1966); (c) when relationships are linear and r g is held constant, 
subjects do better as cue inter cor relations (redundancy) increase (Naylor 6 
Schenck, 1968); (d) subjects can learn to detect changes in relative cue weights 
over time although they do so slowly (Peterson, Hammond, 6 Summers, 1965a); (e) 
it is easier for subjects to learn which cue to use than to discover which 
functional rule relates a known valid cue to the criterion. Learning both of 
these simultaneously is especially difficult (Summers, 1967); (f) in a two- 
cue task , pairing a cue of low or medium validity with one of high validity 
is detrimental to performance (a distraction effect) while pairing a cue of . 
low validity with another of medium or low validity is facilitative (Dudycha 6 
Naylor, 1966); and (g) subjects can learn to use valid cues even when they 
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are not reliably perceived (Brehmer, in press). 

Conservatism has not been an explicit concern in many multiple-cue 
learning studies. However, one study, by Peterson, Hammond, and Summers 
(1965b) found that subjects failed to weight the most valid of three cues 
heavily enough and slightly overweighted the cue with lowest validity. 

Peterson et al . noted the similarity of these results to those of Bayesian 
performance tasks. 

A number of studies have investigated the effects of different modes of 
feedback upon correlational learning. Outcome feedback works but is 
relatively slow. Lens model feedback,' indicating how a subjects cue utili-. 
zation coefficients compare with the ecological validities, is far more 
effective (Newton, 1965; Todd 6 Hammond', 1965).. 

The lens model paradigm has also been extended to the problem of analyzing 
interpersonal learning and conflict between pairs of individuals. (Hammond, 
1965; Hammond £ Brehmer, in press; Hammond, Todd, Wilkins, £ Mitchell, 1966; 
Hammond, Wilkins £ Todd, 1966; Rappoport, 1965). A typical experiment trains 
pairs of subjects to use one of two cues in either linear or nonlinear fashion. 
Each subject learns to use a different cue, perhaps in different ways as 
well. After training, subjects are brought together to learn to predict a 
new criterion, using the same cues. Typically both cues must be used in this 
second task, and the subjects’ training leads them initially to disagree 
with one another and with the outcome feedback they receive from the task. 

Lens model analysis of each subject’s individual judgments and the pair’s joint 
judgments provides a great deal of information about the mechanisms whereby 
subjects learn from the task and from one another. A study by Brehmer (1969b) 
found that the differences between subject’s policies are rapidly reduced in 
the joint task but this reduction is accompanied by increased inconsistency 
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such that overt discrepancies are not very much diminished by the end of 
the conflict period. Brehmer argued that it is necessary to invent methods 
to display to. the subjects the real sources of their disagreement. Another 
interesting finding from this area is that persons initially trained to 
have linear policies are less likely to change than are persons with more 
complex 5 nonlinear policies (Brehmer , 1969b ;Earle, 1970). 

Non-metric stimuli , events and responses . Bjorkman (1967) has applied 
the lens model to the learning of non-metric stimuli that are predictive of 
non-metric events. Bjorkman makes the distinction between "event learning", 
(also known as probability learning) where the subject must learn to 
predict by means of relative frequencies of different events and 
11 stimulus -event learning" where stimuli function as cues for events. 

Bjorkman (1969a) studied performance in a 2 x 2 task (two cues and two events) 
and found a substantial degree of differential maximizing * a strategy whereby 
the subject always predicts the most likely event , given the particular 
cue. This strategy can be considered an optimal one in the sense that it 
maximizes the number of correct predictions. A similarly high level of 
optimality was observed by Summers (in press) in a 2 x 2 task and by Beach 
(1964) and Howell and Funaro (1965) in more complex prediction tasks. 

The latter study used scaled cue values. These results contrast with the 
phenomenon of probability matching whereby the subject matches his response 
probabilities to the relative frequencies of the events. Probability 
matching is commonly observed in simpler event -learning tasks. The optimal 
strategy of predicting the most probable event every time is a tedious 
chore in these simple event -learning tasks. In stimulus-event learning, 
subjects can maximize without being completely repititious and this may 
account for their increased optimality in these more complex tasks. 
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One other finding of interest in these kinds of tasks is that subjects* 
responses become much more consistent as soon as "feedback is removed 
(Azuma 6 Cronbach* 1966; Bjorkman* 1963d ; Bruner * Goodnow, & Austin* 1956). 

The presence of feedback apparently promotes hypothesis testing wherein 
subjects attempt to outguess the random sequence of events. 

Bayesian Studies of Learning 

Bayesian researchers have been notably uninterested in the topic of 
learning; they usually treat learning as a confounding to be avoided. 

Many Bayesian studies have used situations like bookbags and poker chips with 
which the experimenters assume the subject is already familiar. Others (e.g., 
Lichtenstein S Feeney* 1968) have given initial training trials * with feedback. 
However 3 this training data is usually not analyzed. 

The epitome of indifference to learning is illustrated in an article 
by Peterson (1968). Peterson*s subjects responded to more than 8000 four- 
data sequences * but the study does not mention whether feedback was given 
(presumably it was not)* and all analyses are based on all the data* withqut 
any attention paid to changes over time. Peterson* like most other Bayesian 
researchers* is interested in how subjects behave — not how they learn. 

A few studies do look at learning and merit attention. 

The effects of feedback . Edwards* Phillips* Hays* and Goodman (1968) 
reported a study which compared two groups of subjects who gave likelihood 
ratio responses; these responses were then cumulated* that is* converted 
into posterior odds estimates* by the experimenters* using Bayes* theorem. 

One group received feedback of the cumulated posterior odds after each 
estimate; the other group received no feedback. This type of feedback was 
found to degrade the cumulated posterior odds -- making them more conservative 
although changes. over time were not reported. 
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Goldstein 3 Emanuel, and Howell (1968) varied diagnosticity, percent 
of feedbacks and specificity of feedback. They found that learnings as 
evidenced by increased optimality * of response over time, occurred only for 
the high diagnosticity conditions not for the more difficult conditions. 

The performance with more difficult data started at and stayed at the best 
level finally obtained in the easy, i.e., highly diagnostics condition. 

They also found no differences in optimality between 100% feedbacks 67% 
feedbacks 33% feedback, and n(D feedback. These peculiar results were 
attributed to the unusual task employed: guessing whether a number drawn from 

one normal distribution was larger than or smaller than a number drawn (but; 
not exposed) from another normal distribution (this is like asking "if this 
female is 5* 8”, will a randomly selected male be taller or shorter?"). The 
distribution from which the exposed number was drawn alternated on each trial. 
A simple but non-optimal strategy of using the same cut-off point to determine 
one's response, regardless of which population was sampled first, gave 
excellent results when the means of the two populations were close together 
(low diagnosticity), but worked badly when the means were farther apart 
(high diagnosticity). Thus the latter group was forced to learn and adopt 
a more complicated strategy, using two cut-off points, one for each' distri- 
bution. As to the ineffectiveness of feedback, the authors note that, since 
the subjects saw one number from each distribution on every other trial, they 
apparently learned enough about the situation even wh^n they were not told 
whether their answer was correct. 

Martin and Gettys (1969) gave subjects either nominal feedback (H^ 
generated the data) or probabilistic feedback (the posterior probabilities 
that each hypothesis generated the data are .769 for H^, .108 for H^, and 
.123 for Hg) in a multinomial task. These authors found that the probability 
feedback produced better responses than nominal feedback, but they found no 



evidence that learning had occurred * either across four blocks of 50 trials, 
within the first 50 trials, or in a 20-trial replication. Learning may have 
occurred in the. five pre- experimental practice trials. 

The effects of payoff . Phillips and Edwards (1966) presented 20 
sequences of binomial data to three groups, each with different payoff 
schemes, and to one group which received no payoff. They found that the no- 
payoff group showed a small amount of learning (decreasing discrepancy from 
optimal responses); all payoff groups showed more learning, with no evidence 
of asymptote by the end of the experiment. Performance showed greater improve 
ment in the later half of these 20- item sequences than in the first half, 
suggesting that the subjects learned to use large probabilities as the 
evidence for one hypothesis mounted. 

Learning specific aspects of a probabilistic setting , Stael von Holstein 
(1969) studied the ability to predict the price of 12 well-known stocks 
two weeks in the future. His 72 subjects - , who included bankers, stock market 
experts, statisticians, business teachers and business administration students 
made probability estimates for each stock across five hypotheses (decrease 
more than 3%, decrease 1 to 3%, change less than 1%, increase 1 to 3% and 
increase more than 3%), for 10 consecutive two-week periods. At the end 
of each period, the subjects received outcome feedback. The task proved to 
be exceptionally difficult. Only two of the 72 subjects performed as well as 
a hypothetical, totally ignorant subject who always assigned a probability of 
.2 to every hypothesis. The subjects would have done better by acknowledging 
their own inabilities in stock-market prediction, thus giving more diffuse 
estimates. The subjects did not learn to improve their performance over 
the ten periods, but they did apparently learn, to some extent, how poor 
they were: the spread of their probability estimates across the hypotheses 
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increased over time. 

Schum (1966a) showed that subjects can learn and utilize existing con- 
ditional non- independence in multinomial data. The subjects were warned 
which data sources might be non-independent, but they were not told the 
form of the relationship, nor which of the hypotheses mediated the relation- 
ship. They were taught to tabulate the frequencies with which the data 
occurred in such a way that the non-independence could be seen. Thus the 
outstanding achievement of the subjects was not that they could learn what 
interdependencies existed, but that they could utilize this information appro- 
priately in their posterior probability estimates — their responses more 
closely matched a model utilizing the non-independence than a model in which 
independence was falsely assumed. 

Two additional learning studies were oriented to the previously- 
discussed misperception explanation of conservatism. In order to heighten 
the point that subjects 1 conservatism resulted from their misunderstanding 
of the data generator, Pe.terson, DuCharme, and Edwards (1968) showed that 
subjects we re less conservative after they had seen 100 illustrative samples 
of data from the binomial data generator. Wheeler and Beach (1968) ampli- 
fied this finding. They not only showed their subjects 200 binomial samples, 
but they asked the subjects to make a bet on which population generated the 
data, for each sample. The effects of such training were seen in increased 
accuracy of subjects 1 estimated sampling distributions and decreased conser- 
vatism. 



Descriptive Strategies; 

What is the Judge Really Doing? 
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Thus far we have tied our presentation of theoretical notions and 
empirical results rather closely to the Bayesian and regression paradigms. 
In doing so, we have accepted the validity of their models rather 
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uncritically as descriptive indicators of cognitive processes. In this 
section we shall examine- a few studies that point to the deficiencies of 
these models with regard to providing insight into cognition. These studies 
aim to uncover specific strategies or cognitive rules that subjects employ 
in order to produce the judgments demanded of. them. 

Forewarning of the desciU-ptive problems of models was provided by 
Hoffman (i960), whose paper not only provided much of the impetus for 
correlational research but also presented a cogent discussion of the dis- 
tinction between simulating behavior and actually capturing the ongoing 
psychological processes. Nevertheless, with but a few exceptions, the 
ensuing research has not been oriented towards uncovering strategies . 

Instead most research has implicitly assumed that the various regression and 
Bayesian models that summarize the data so well actually describe cognitive 
processes. 

Strategies in Correlational Research 

Starting-point and adjustment strategies . The present authors have 
recently conducted several experiments that seem to provide insight into the 
cognitive operations performed by decision makers as they attempt to integrate 
information into an evaluative judgment. In a study by Slovic and Lichten- 
stein (1968), the stimuli were gambles, described by four risk dimensions — 
probability of winning (P^), amount to win ($ w ), probability of losing (P^), 
and amount to lose ($ T ). 

One group of subjects was asked to indicate their strength of preference 
for playing each bet on a bipolar rating scale. Subjects in a second group 
indicated their opinion about a gamble's attractiveness by equating it with 
an amount of money such that they would be indifferent between playing the 
gamble or receiving the stated amount. This type of response is referred to 
as a "bid." 
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The primary data analysis consisted of correlating each subject’s 
responses with each of the risk dimensions across a set of ^gambles. These 
correlations indicated that the subjects did not weight the risk dimensions 
in the same manner when bidding as when rating a gamble in monetary units. 
Ratings correlated most highly with' while bids were influenced most by 
an ^ • 

Both bids and ratings presumably reflect the same underlying character- 
istic of a bet — namely, its worth or attractiveness. Why should subjects 
employ probabilities and payoffs differently when making these related 
responses? The introspections of one individual in the bidding group are 
especially helpful in providing insight into the type of cognitive process 
that could lead bidding responses to be overwhelmingly determined by just 
one payoff factor. This subject said, 

!, If the odds were . . . heavier in favor of winning . . . rather than 
losing . . . , I would pay about 3/4 of the amount I would expect to win. If 
the reverse were true, I would ask the experimenter to pay me about . . . 

1/2 of the amount I could lose." 

Note this subject’s initial dependence on probabilities followed by a 
complete disregard for any factor other than the winning payoff for attractive 
bets or the losing payoff for unattractive bets. After deciding he liked a 
bet, he used the amount to win, the upper limit of the amount he could bid, as 
a starting point for his response. He then reduced this amount by a fixed 
proportion in an attempt to integrate the other dimensions into the response. 
Likewise , for unattractive bets , he used the amount to lose as a starting 
point and adjusted it proportionally in an attempt to use the information 
given by the other risk dimensions. Such adjustments, neglecting to consider 
the exact levels of the other dimensions , would make the final response 
correlate primarily with the starting point — one of the payoffs in this 
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case. 

It is interesting to note that this starting point and adjustment 
process is quite similar to the fixed-percent markup rule that businessmen 
often use when setting prices (Katona* 1951). This type of process can be 
viewed as a cognitive shortcut employed to reduce the strain of mentally 
weighting and averaging several dimensions at once* 

The observation of simple starting point and adjustment procedures in 
bidding and pricing judgments has led the first author to conduct an extensive 
and still unfinished study to uncover strategies by which subjects average 
just two numerical cues into an evaluative judgment. Preliminary analysis of 
the data indicates that 5 even in this relatively simple t’ask 5 subjects tend 
to use a single cue as a starting point for their judgment. Next., they 
adjust this starting judgment rather imprecisely in an attempt to take the 
other cue into account. These data suggest that the subjects 9 although 
college students of above average • intelligence 5 resorted to simple strategies 
in order to combine the two cue values. They were not skilled arithme- 
ticians 5 able to apply regression equations or produce weighted averages 
without computational aids. 

Strategies in multiple-cue learning . Close examination of multiple-cue 
learning studies provides further evidence for ’simple strategies. For 
example 5 Azuma and Cronbach (1966) studied the ability of subjects to learn 
to predict a criterion value on the basis of several cues. When subjects 1 
responses were correlated with the cue values over blocks of trials 5 the 
results indicated an orderly 'progression towards proper weighting of the 
cues. However s when successful learners were asked to give introspective 
accounts of the process by which they made their judgments 5 these reports 
bore little resemblance to the weighting function employed by the experimenters. 
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Instead they typically described a sequence of rather straightforward 
mechanical operations. Azuma and Cronbach observed- that., although the 
experimenter regards the universe of stimuli as an undifferentiated whole, 
their subjects isolated sub-universes and employed different rules within 
each of these. If correlations are to be used, they argued, they should be 
calculated separately for each sub-universe of stimuli. 

Strategies for Estimating P(H/D) 

The modern theory of probability was conceived during the 17th Century 
when an aristocratic Frenchman, the Chevalier de Mere, realized that reason 
could be substituted for painful experience in determining one * s chances at 
the gambling tables. Since that time, ^students of the theory have been 
continually amazed at its subtlety and the extent to which answers derived 
from it conflict with their Intuitive expectations. Nevertheless, a recent 
review by Peterson and Beach (1967) concerning man’s capabilities as an 
"intuitive statistician" came to an optimistic conclusion. Peterson and 
Beach asserted that: 

"Experiments that have compared ‘ human inferences with those 
of statistical man show that the normative model provides 
a good first approximation for a psychological theory of 
inference. Inferences made by subjects are influenced by 
appropriate variables and in appropriate directions" [Pp. 42-43]. 

Even the spectre of conservatism has failed to dampen the optimism of 
some researchers. Beach (1966) and others attributed conservatism to erron- 
eous subjective probabilities rather than an inadequate Bayesian processing 
of this information. 

Our own examination of the experimental* literature suggests that the 
Peterson and Beach view of man's capabilities as an intuitive statistician i 
too generous. Instead, the intuitive statistician appears to be quite con- 
fused by the conceptual demands of probabilistic inference tasks. He seems 
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capable of little more than revising his response in the right direction upon 
receipt of a new item of information (and the inertia effect is evidence 
that he is not always successful even in this). After that s the success he 
obtains 'may be purely a matter of coincidence — a fortuitious interaction 
between the optimal strategy and whatever simple rule he arrives at in his 
groping attempts to ease cognitive strain and to pull a number “out of the 
air. ** 

Constant Ap strategy . There are several simple strategies that seem to 
us to highlight subjects* difficulties in conceptualizing the requirements of 
probabilistic inference tasks and 5 at the same. time 9 explain many of the 
ethereal phenomena that comprise the “conservatism** effect. The first such 
strategy is to revise one’s P(H/D) response by a constant 5 Ap 5 regardless of 
the prior probability of the hypothesis or the diagnosticity of the data. 

The strongest evidence for this strategy comes from Pitz 5 Downing 5 and 
Reinhold (1967). Subjects saw sequences of either 5 5 10 > or 20 data items 
and made a probability revision after each datum. Three different levels of 
data diagnosticity were employe d a using a binomial task. The results 
indicated the usual inverse relationship between diagnosticity and conser- 
vatism with some subjects overreacting to data of low validity. Longer 
sequences produced greater conservatism. Pitz et al. noted that events which 
confirmed the favored hypothesis resulted in approximately equal changes 
in subjective probability 5 regardless of a subject’s prior probability. 

There was little difference between changes for sequences of lengths 5 and 10 
but the average change for sequences of length 20 was considerably lower 5 
as if subjects were holding back in anticipation of a greater amount of 
future information. The experimenters also reported the “remarkable fact 1 * 
that the average change was not a function of the nature of the two 
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hypotheses but, instead, was approximately the same across the three levels 

of diagnosticity . They concluded with the observation that: 

"The' fact that changes in subjective probability were a 
constant function of prior probabilities , were indepen- 
dent of the nature of the hypotheses, yet were not 
independent of the length of the sequence of data, implies 
that a subject's performance in a probability revision 
task is nonoptional in a more fundamental way than is 
implied by discussions of conservatism. Performance is 
t determined in large part by task characteristics which 
are irrelevant to the normative model. ... It may not 
be unreasonable to assume that . . . the probability 
estimation' task is too unfamiliar and complex to be 
meaningful" (Pitz, Downing, £ Re in ho Id , 1967; p. 392). 

This same sort of insensitivity to gross variations in sample diagnosticity 

is evident in studies by Martin (1969), Peterson and Miller (1965), Peterson, 

Schneider, and Miller (1965), and Sebum, and Martin (1968) and serves to 

explain the many of the effects observed there. 

S imilarity strategies . The second type of strategy for making probab- 
ility estimates appears in several studies. The subjects base their 
responses on the similarity of the data with whatever striking aspect of 
the situation the experimenter has provided. This strategy was observed 
by Dale (1968) in a pseudo-military task involving four hypotheses. The 
values of P(D7H J were displayed as histograms. As the subjects received 
the ten data reports , they often physically arranged the data reports to 
form a histogram which they then compared with the conditional probability 
display. The relative magnitudes of their responses appeared to be based 
upon the similarity between the pattern formed by the data and the pattern 
formed by each of the conditional distributions. Dale notes that the 
subjects were at a loss to know what magnitude of probability to assign a 
given level of similarity. One subject, when he had assessed the probability 
of the correct hypothesis at .38 (the Bayesian probability was .98), 
remarked: "Getting mighty high!" 
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Lichtenstein and Feeney (1968) also observed a kind of similarity 
strategy. Their subjects were shown the locations of bomb blasts and had 
to estimate the probability that the intended target was City A or City B. 

The subjects were told that the errors were unbiased in that a bomb was 
just as likely to miss its target in any direction. They were also told 
that a bomb was more likely to fall near its target than far from it. The 
subjects 1 responses were clearly discrepant from the optimal responses 
derived from the circular normal data generator. Several subjects reported 
that they compared the distances of- the bomb site from the two cities and 
based their estimates on this comparison, that is, on the similarity between 
the location of the datum and the locations of the cities. A model assuming 
that probability estimates were simply 'a function of the ratio of the two 
distances did a much better job of predicting the responses of most subjects 
than did the "correct” circular normal model. 

The binomial task provides the subject with the least amount of explicit 
information against which the subject can compare the sample. In such tasks, 
several independent studies have shown that the subjects make their responses 
match the sample. For. example, Beach, Wise, and Barclay (1970), using a 
binomial^ task with a simultaneous sample of n items , found a remarkably 
close relationship between the sample proportion and the posterior probab- 
ility estimates. Several of their subjects remarked that sample proportions 
are very compelling because they are available (and somehow relevant) 
numbers in a very difficult and foreign task. Studies by Kriz (1967) and 
Shanteau (1969) have reported similar use of sample proportions as the 
basis for P(H/D) estimates. This simplifying strategy does not take 
into account the likelihood of the data, as specified by the popu- 
lation proportions. Subjects thus would not change their responses 
across tasks that vary in population proportion (diagnosticity ) ; 
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this lack of sensitivity has been reported by Beach, Wise, and Barclay 
(1970) and by Vlek (1965), who suggested that n . . . subjects do not look 
further than the sample presented ; to them. 11 (p. 22). 

For the usual levels of diagnosticity found in binomial tasks, the 
strategy of using the sample proportion to estimate P(H/D) will produce very 
conservative performance. Beach et al. (1970) concluded that this strategy 
is a spurious one that invalidates the bookbag and poker chip task as an 
indicant of subjective probability revision. It seems to us that this may 
be too harsh a judgment in light of the ubiquity of simple strategies for 
inference across a variety of laboratory and real-life judgment situations. 

Aiding the Decision Maker 

Experimental work such as we have just described documents man ! s 
difficulties in processing multidimensional and probabilistic information. 
Unfortunately, there is abui. -ant evidence indicating that these difficulties 
persist when the subject leaves the artificial confines of the laboratory 
and resumes the task of using familiar sources of information to make 
decisions that are important to himself and others. Examples of improper and 
overly simplistic use of information have been found in business decision 
making (Katona, 1951), military decision making (Wohlstetter , 1962), 
governmental policy (Lindblom, 1964), design of scientific experiments 
(Tversky 6 Kahneman, in press), and management of our natural resources 
(Kates, 1962$ Russell, 1969$ White, 1966). Agnew and Pyke (1969$ p. 39) 
note that a decision maker left to his own devices 

. . uses, out of desperation, or habit, or boredom, or exhaustion, 
whatever decision aids he can — anything that prepackages information." 




Among the vast assortment of decision aids described by Agnew and Pyke are 
rumors, cultural biases and self-evident truths, common sense, appeal to 
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authority, and appeal to experts who, themselves, are all too fallible. 

The need for effective decision aids has not gone unnoticed, however. 

This is an age of technological advancement that creates more difficult and 
more important decision problems as it provides man with ever more power to 
manipulate his environment. It is not surprising, therefore, that this same 
technological bent has been focused upon the decision-making process itself. 

The aim of this section is to describe two recent and distinctive con- 
tributions of the regression and Bayesian approaches to the improvement of 
decision making. 

P robabilistic Information Processing Systems 

A great deal of Bayesian research* has centered about some new ideas 
for putting probability assessments to use in diagnostic systems. Edwards 
(1962) introduced the notion of a probabilistic information processing (PIP) 
system because of his concern about optimal use of information in military 
and business settings. He distinguished two types of probabilistic outputs 
for such a system. The first was diagnosis (what is the probability that 
this activity indicates an enemy attack?) and the second was parameter ■ 
estimation (how rapidly is that convoy moving and in what direction?). 

His proposal was simple. Let men estimate P(D/H), the probability that a 
particular datum would be observed given a specified hypothesis, and let 
machines integrate these P(D/H) estimates across data and across hypotheses 
by means of Bayes T theorem. After all the relevant data have been processed, 
the resulting output is a posterior probability, P(H/D) , for each hypothesis. 
Edwards originally designed the PIP system with the intention of using 
Bayes T theorem as a labor-saving device. However, research subsequently 
indicated that difficulties in aggregating data led subjects’ unaided posterior 
probability estimates to be markedly conservative. The need to develop an 
antidote for conservatism thus added considerable impetus to the development of 
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PIP systems. 

Edwards and Phillips (1964) promoted the PIP system as a promising 
alternative to traditional command and control systems. They hypothesized 
that PIP would produce faster and more accurate diagnoses for several reasons. 
First, Bayes 1 theorem is an optimal procedure for extracting all the certainty 
available in data. It automatically screens information for relevance, 
filters noise, and weights each item appropriately. In addition, PIP 
systems promise to permit men and machines to complement one another, using 
the talents of each to best advantage. 

Sometimes P(D/H) values are readily calculable from historical information; 
or from some explicit model of the data-generating device. However, in many 
cases. > no such probabilities exist. Edwards and Phillips observed, for 
example, that calculation is inadequate to assess the probability that 
Russia would have launched 25 reconnaissance satellites in the last three 
days if she planned a missle attack on the United States. Only human 
judgment can evaluate this type of information; PIP systems obtain and use 
such judgments systematically. 

Given the basic idea of a PIP, much experimental research was needed 
before it could be implemented effectively. Edwards and Phillips discussed 
the need to verify the basic premise that men’ can be taught to be good 
estimators for probabilities. One question concerned the most effective 
method for making such estimates. For example, should men estimate P(D/H) 
values directly or estimate other quantities from which P(D/H) can be 
inferred? Subsequent research indicated that it is easier to estimate 
likelihood ratios than to estimate P(D/H) values themselves, because the . * 
latter are influenced by many irrelevant factors such as the level of detail 
with which the datum is specified (Edwards, Lindman, 6 Phillips, 1965). 
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Perhaps the most important research need was to evaluate the effectiveness 
of PIP systems in realistically complex environments. A number of such 
studies have been completed in recent years. One of the most extensive and 
carefully done studies was by Edwards , Phillips, Hays, and Goodman (1968). 

They constructed an artificial, future world (complete with "history" up to 
1975) and wrote 18 scenarios, each with 60 data items. The subjects related 
the data to six hypotheses concerning war within the next 30 days, e.g., 
was "Russia and China are about to attack North America," while H r was 

D 

"Peace will continue to prevail." Four groups of subjects received intensive 
training in the characteristics of the "world," and then each group was 
trained in a particular response task. All subjects thep responded to the 18 
scenarios. The PIP groups responses were likelihood ratios. To each datum 
five ratios were given, comparing in turn the likelihood of the datum given 
each of the war hypotheses against the likelihood of the datum given the 
peace hypothesis. The responses were registered on log-odds scales. 

The "POP" group responded with posterior odds, estimated upon receipt 
of each new datum. Again, each of the war hypotheses was compared in turn 
to the peace hypothesis. 

The "PEP" group responded by naming, for each war hypothesis, the fair 
price for an insurance policy that would pay 100 points in the event of 
that particular* war, and nothing in the event of peace. 

The "PUP" group gave probability estimates comparable to the PEP 
group ! s price estimates. 

Thus, of the four groups, only the PIP group, who gave likelihood 
ratios, were relieved of the task of cumulating evidence across the 60 
data in each scenario. In PIP, this aggregation was done by machine to 




compute final odds. 
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No optimal model can be devised for this simulation. The "true" 
hypothesis for any scenario was not known. Results • showed , however, that 
the PIP group arrived at larger final odds than other groups. When PIP 
showed final odds of 99:1, other groups showed final odds from 2:1 to 12:1. 
Because of this greater efficiency, the authors concluded that PIP was 
superior to the other systems. 

The problem of finding a task complex enough to warrant the comparison 
of P(D/H) responses (PIP) with P(H/D) responses (POP), while still providing 
an optimal model against which to evaluate both methods, was solved ingen- 
iously by Phillips (1966, also reported in Edwards, 1966). The 
data were thirty bigrams, combinations of two letters such as "th" or "ed." 

The hypotheses were that the bigrams were drawn either from the first 
two letters of words, or from the last two letters of words. The bigram 
"ed" might thus be viewed as beginning a word (like editor) or ending a 
word (lik§ looked) . Phillips 1 subjects were six University newspaper editorial 
writers; data came from their own editorials. Frequency counts using the 
subjects* editorials (not shown to them) provided the veridical probabilities 
against which their responses could be compared. For the PIP task, all 
subjects estimated the likelihood ratio (P(D/H 1 ) / P(D/H 2 )) for each bigram. 
Then, for the POP task, they were asked to imagine that the bigrams had been 
placed in two bookbags according to their frequencies of use, i.e., if "my" 
had occurred 20 times at the beginning of words and 40 times at the end of 
words, the 20 "my" bigrams were placed in bag B, and 40 in bag E. One bag 
was chosen by the flip of a coin, and 10 bigrams were successively sampled. 

The subjects gave posterior odds estimates after each draw. Following this 
POP task, they repeated the PIP task. 
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Results showed that in the PIP task subjects were modestly successful 
at estimating the relative frequencies of their own use of bigrans, but five 
of the six subjects were conservative. In the POP task they were much more 
conservative; they treated all but two of the bigrams as if they provided 
little or no diagnostic information. 

Kaplan and Newman (1966) reported the results of three experiments 
designed to evaluate PIP in a military setting. In two out of three studies 
the PIP technique showed a definite superiority over a POP condition. This 
superiority was particularly evident early in the data sequence. The 
authors speculated that the relatively poor performance of the PIP system 
in the third experiment may have been due to the fact that subjects there 
were provided with the output of Bayes 1 theorem after each datum was pre- 
sented, making it difficult to evaluate each item of information on its own 
merit. Edwards, Phillips, Hays, and Goodman (1968) and Schum, Southard, and 
Wombolt (1969) also found a detrimental -effect from showing P(D/H) estimators 
the current state of the system. 

A major effort to evaluate the idea of a PIP system within the context 
of threat evaluation has been carried out at Ohio State University under 
the direction of David Schum and his colleagues. The results are described 
in Briggs and Schum (1965), Howell (1967), Schum (1967, 1968, 1969), 

Schum, Goldstein, and Southard (1966), and Schum, Southard, and Wombolt 
(1969). Unlike the PIP simulations of Edwards and Kaplan and Newman, the 
Ohio State research employed a frequentistic environment where the experi- 
menters specified a P(D/H) matrix that governed the sampling from a limited 
set of data to form a number of scenarios. Subjects had to learn the 
import of various data items by accumulating relative frequencies linking 
data and hypotheses. The subjects were intensively trained in making 
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probabilistic judgments and were quite familiar with the characteristics 
of the information with which they were dealing. Howell (1967) has summarized 
the first six years cf research at Ohio State , concluding that automation of 
the aggregation process (i.e. , PIP) can be expected to improve the quality 
of decisions in a wide variety of diagnostic conditions. He also observed 
that the superiority of a PIP system is most pronounced under degraded , 
stressful 3 or otherwise difficult task conditions. 

In contrived or simulated diagnostic situations, the PIP system seems 

to be a promising device for producing posterior probabilities. Recent 

endeavors have attempted to step up the complexity of the simulations in 

an attempt to narrow the gap between them and real-world diagnostic systems. 

Schum (1969) discusses some of the problems that must be solved as more 

realistic complexity is introduced into the system. For example * in 

systems that periodically experience high rates of data accumulation, 

experts who assess P(D/H) may have to aggregate their judgments over a 

series of data (i.e., judge P(D.D D 0 ... D /H) . When data items are 

1 2 o n 

i 

non independent , these conditional probabilities can become quite complex. 

Three experiments reported by Schum, Southard, and Wombolt (1969) found 
that men could adequately aggregate diagnostic import across small samples 
of such nonindependent data. In addition, PIP was increasingly superior to 
POP when scenarios were either large or highly diagnostic or both. There is 
no longer any doubt that PIP is a viable concept for the design of decision 
systems. Future work will most likely see the extension of PIP to non- 
military settings along with greater attention to the practical details of 
implementing such systems in the real world. PIP systems have already been . * 
proposed for medicine (Lusted, 1968; Gustafson, 1969; Gustafson, Edwards, 
Phillips, 6 Slack, 1969), and probation decision making (McEachern £ Newman, 
1969) and applications to weather forecasting, law., and business seem imminent. 
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Bootstrapping 

Can a system be designed to aid the decision maker that takes as input 
his own judgments of complex stimuli? One possibility is based on the 
finding that regression models , such as the linear models can do a remark- 
ably good job of simulating such judgments. An intriguing hypothesis about 
cooperative interaction between man and machine is that these simulated 
judgments may be better, in the sense of predicting some criterion or 
implementing the judge ! s personal values ,. than were the actual judgments 
themselves. Dawes (1970) has termed this phenomenon !, Bootstrapping. " 

The rationale behind the bootstrapping hypothesis ' is quite simple. 

Although the human judge possesses his .full share of human learning and 

hypothesis generating skills, he lacks the reliability of a machine. As 

Goldberg (1970, p. 423) puts it, 

M He ! has his days 1 : Boredom, fatigue, illness, 

situational and interpersonal distractions all 
plague him, with the result that his repeated 
judgments of the exact same stimulus configuration 
are not identical. He is subject to all these 
human frailties which lower the reliability of 
his judgments below unity. And, if the judge's 
reliability is less than unity, there must be error 
in his judgments — error which can serve no other 
purpose than to attenuate his accuracy. If we 
could . . . [eliminate] the random error in his 
judgments, we should thereby increase the validity 
of the resulting predictions. 11 

Of course, the bootstrapping procedure, by foregoing the usual process 
of criterion validation, is vulnerable to any misconceptions or biases 
that the judge may have. Implicit in the use of bootstrapping is the assump 
tion that these biases will be less detrimental to performance than the 
inconsistency of unaided human judgment. 

Bootstrapping seems to have been explored independently by at least 
four groups of investigators. Yntema and Torgerson (1961) reported a study 
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that suggested its feasibility* Their subjects were taught, via outcome 
feedback 3 to predict a rather nonlinear criterion. After 12 days of practice , 
they were given a set of test trials and their average correlation with 
the criterion was found to be .84. Then a linear regression model was 
computed for each subject on the basis of his responses during the final 
practice day. When these models were used to predict the criterion, the 
average correlation rose to .89. Thus consistent application of the linear 
model improved the predictions even though the subjects had presumably 
been taking account of non-linearities in making their own judgments. 

Yntema and Torgerson saw in these results the possibility that artificial , 
precomputed judgments may in some cases be better than those the man could 
make himself if he dealt with each situation as it arose. More recently, 
Dudycha and Naylor (1966) have reached a similar conclusion on the basis of 
their observation that subjects in a multiple-cue learning task were employing 
the cues with appropriate relative weights but were being inaccurate due 
to the inconsistency of their judgments. They concluded that, although humans 
may be used to generate strategies, they should then be removed from the 
system and replaced by such strategies. 

Bowman (1963) dutlined a bootstrapping approach within the context of 
■managerial decision making that has stimulated considerable empirical 
research (see Gordon, 1966; Hurst and McNamara, 1967; Jones, 1967; and 
Kunreuther, 1969). Kunreuther, for example, developed a linear model of 
production scheduling decisions in an electronics firm. Coefficients were 
estimated to represent the relative importance of sales and inventory 
variables across a set of decisions made by the production manager. Under 
certain conditions , substitution of the model for the manager was seen to 
produce decisions superior to those the manager made on his own. 
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At about the time that Bowman was proposing his version of bootstrapping. 
Ward and Davis (1963) were advocating the same kind of approach to man-computer 
cooperation. Although they presented no data. Ward and Davis outlined several 
applications of the method in tasks such as estimating the time it would take 
to retrain 500 people, who now hold 500 existing jobs, to 500 new, possibly 
different jobs. Here a model would be built to capture an expert judge* s 
policy on the basis of a relatively small number of cases. The model could 
then be substituted for the expert on the remaining cases out of the possible 
set of 250,000. Ward and Davis also outlined an application ox bootstrapping 
for the purpose of assigning personnel to jobs so as to maximize the payoff 
of the assignments. 

* -Goldberg (1970) evaluated the merits of bootstrapping in a task where 
29 clinical psychologists had to predict the psychiatric diagnoses of 861 
patients on the basis of their MMPI profiles. A linear model was built 
to capture the weighting policy of each clinician. When models of each 
clinician were constructed on the basis of all 861 cases, 86% of these models 
were more accurate predictors of the actual criterion diagnoses than the 
clinicians from whom the models were derived. There was no instance of a 
man being greatly superior to his model. When a model was constructed on 
only one-seventh of the cases and used to predict the remaining cases, it 
was still superior to its human counterpart 79% of the time. While the average 
incremental validity of model over man was not large, the consistent 
superiority of the model suggested considerable promise for the bootstrapping 
approach. 

Another recent demonstration of bootstrapping comes from a study of 




a graduate-student admissions committee by Dawes (1970). Dawes built a 
regression equation to model the average judgment of the four-man committee 
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The predictors in the equation were overall undergraduate grade point average, 

quality of the undergraduate school, and a score from the Graduate Record 

Examination. To evaluate the validity of the model and the possibility of 

bootstrapping, Dawes used it to predict the average committee rating for each 

of a new sample of 384 applicants. The r value for predicting the new 

s 

committee ratings was .78. Most important, however, was the finding that 
it was possible to find a cut point on the distribution of predicted scores 
j-uch that no one who scored below it was invited by the admissions committee. 
Fifty- five percent of the applicants scored below this point, and thus could 
have been eliminated by a preliminary screening without doing any injustice 
to the committee’s actual judgments. Furthermore, the weights used to 
predict the committee’s behavior were better than the committee itself in 
predicting later faculty ratings of the selected students. In an interesting 
cost-benefit analysis, Dawes estimated that the use of such a linear model 
to screen, applicants could result in an annual savings of about 18 million 
dollars worth of professional time across the nation’s graduate schools. 
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C on cl udin'g ‘ Remark s 

» 

Some Generalizations about the State of our Knowledge 

What have we learned about human judgment as a result of the 
efforts detailed on the preceeding pages? Several generalizations seem 
appropriate. First, it is evident that the judge responds in a highly pre- 
dictable manner to the information available to him. Furthermore, 
much of what we used to call intuition can be explicated in a precise and 
quantitative manner. With regard to this point, it appears that one’s self insight 
into his own cognitive processes is deficient and there is much to be gained, 
by appropriate feedback of the quantitative aspects of one’s judgment behavior- 
Second, we find that judges have a very difficult time weighting and 
combining information — be it probabilistic or deterministic in nature. 

To reduce cognitive strain, they resort to simplified decision strategies, 
many of which lead them to ignore- or misuse relevant information. 

The order in which information is received affects its use and inte- 
gration. The specific form of sequential effects that occur is very much, 
dependent upon particular circumstances of the decision task. Similarly, 
the manner in which information is displayed and the nature of the required 
response greatly influence the use of that information. In other words, the 
structure of the judgment situation is an important determinant of information 
use. 

Finally, despite the great deal of research already completed, it is 
obvious that we know very little about many aspects • of information use in 
judgment. Few variables have been explored in much depth — even such fun- - 
damental ones such as the number of cues, cue -redundancy , or the effects of 
various kinds of stress. And the enormous task of interfacing this area 
with the mainstream of cognitive psychology — work on concept formation. 
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memory, learning, attention, etc., — remains to be undertaken. 
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Does the Paradigm Dictate the Research? 

One of the objectives of this chapter was to determine whether the 
specific models and methods characteristic of each research paradigm tend 
to focus the researcher’s attention on certain problem areas while causing 
him to neglect others. Such focusing has obviously occurred. For example, 
the Bayesians have been least concerned with developing descriptive models, 
concentrating instead on comparing subjects’ performance with that of an 
optimal model, Bayes* theorem. They have paid little attention to the learning 
of optimality, however. Researchers within the correlational paradigm have 
available their own optimal model in the multiple regression equation but 
have shown little interest in comparing subjects with it (except for a sub- 
stantial number of learning studies). Instead, they have spent a great 
deal of effort using correlational methods to describe a judge’s idiosyncratic 
weighting process -- an enterprise in which Bayesians and functional measure- 
ment researchers have been uninterested. Researchers using functional 
measurement to study impression formation have concentrated on distinguishing 
various additive and averaging models and delineating sequential effects at 
the group level. 

These different emphases are further illustrated by the fact that 
experimental manipulations which are similar from one paradigm to the other 
have been undertaken for quite different purposes. For example, the Bayesians. 
have studied sequence length to gauge its effects on conservatism; set size 
was studied in impression formation in order to distinguish adding and 
averaging models.; and the number of cues was varied by correlational re- 
searchers to study the effects upon consistency and complexity of subjects’ 
strategies ! 
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Can these differences in focus be attributed to the influence of the 
model used? Is a researcher inevitably steered in a particular direction by 
his chosen model? To some small extent, we can see that this is true. A 
correlationalist would find it difficult to use, as his cues, intelligence 
reports: "General Tsing was seen last Monday lunching with Ambassador Ptui." 

Instead, he will feel more comfortable with conceptually continuous cues such 
as MMPI scores, or Grade Point Averages. Similarly, at the level of the 
criterion, a Bayesian is most comfortable working with a small number of 
hypotheses, while the correlationalist can work conveniently with many, 
provided they are unidimens ionally scaled. 

In general, however, we believe that the major differences in research 
emphasis cannot be traced to differences between the models. On one hand, 
we see neglected problems for which a model is perfectly well suited. Why 
have the Bayesians neglected learning? They have a numerical response, which 
can easily be compared to a numerical optimal response, for every trial; they 
need not partition the data into blocks (as correlationalists must in order 
to compute a beta weight). On the other hand, we see persistant, even stubborn 
pursuit of topics for which the model is awkward. Correlationalists have 
been devoted to the search for configural cue utilization, yet the linear 
model is extraordinarily powerful in suppressing such relationships, and 
interactions in ANOVA must be viewed with suspicion because the technique 
lacks invariance properties under believable data transformations. 

Under the intellectual leadership of researchers such as Brunswik, 

Edwards, Anderson, Hammond, and Hoffman, several excellent research paradigms 
have been wound up around common points of interest, and are chugging rapidly 
down diverging roads. Since any study almost always raises additional ques- 
tions for investigation, there has been no dearth of interesting problems to 
fuel these research vehicles. Unfortunately, these vehicles lack side 
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windows, and few investigators are looking far enough to the left or right. 

Of several hundred studies , only a handful indicate any awareness of the 
existence of comparable research under another paradigm. The fact remains, 
however, that all these investigators are interested in the same general 
problem — that of understanding how humans integrate fallible information 
to produce a judgment or decision — and it may be that they are missing some 
important research opportunities by limiting their approach to a single 
paradigm. 

Towards an Integration of Research Efforts . 

We suggest that researchers should employ a multiparadigm approach, 
searching for the most appropriate tasks, models, and analysis techniques to 
attack the substantive problems of interest to them. We will try to show, 
for several such problems , how such a broader perspective might be advan- 
tageous . 

Sequential effects. The dangers of staying within a single model, and 
the potential value of diversity, are illustrated by research on primacy and 
recency effects. Hendricks and Constantini (1970a) found no effect of 
varying information inconsistency in an impression formation task where 
adjectives served as cues. They argued that attention decrement, riot 
inconsistency, accounts for the primacy commonly found in studies of impression 
formation. However, they were apparently unaware of a number of Bayesian 
studies that did obtain primacy when early and late data were highly incon- 
sistent (Dale, 1968; Peterson 6 DuCharme, 1967; Roby, 1967) and recency when 
later data were not so inconsistent (Pitz 6 Reinhold, 1968; Shanteau, 1969). 

The discrepancy between the Hendricks and Constantini data and the Bayesian 
results needs to be explained. 
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The study by Shanteau (1969) provides a nice example of the utility of 
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applying methods and tasks from different paradigms. Shanteau used a 
functional measurement technique to study sequential effects in a Bayesian 
task. He presented subjects with sequences of data constructed according 
to factorial combinations of binary events. Their task was to estimate P(H/D) 
after each datum was received. Sequential effects appear as main effects of 
serial position in such a design. Two experiments clearly showed that recency 
was operating throughout all stages of sequences as long as 15 items. 

Novelty . How do subjects handle data that is rare or novel? Wyer (1970) 
examined the effects of novelty, defined in terms of the unconditional 
probability of an adjective, upon impression formation. Novel adjectives 
were seen to carry greater weight, making impressions more polarized. This 
increased weight attached to rare data appears to be in contradiction with 
findings from Bayesian research on rare events (Beach, 1968; Vlek, 1965; 

Vlek 6 van der Heijden, 1967). These studies have presented evidence that 
rare events are viewed as uninformative, i.e., they are not given enough 
weight in the decision process. 

Learning . Hammond and his colleagues (e.g. , Hammond 6 Brehmer, in press; 
Todd 6 Hammond, 1965) have long contended that specific feedback derived from 
the lens model (i.e., feedback about the weight the subject gives to each cue, 
and the weight the environment gives to each cue) is more effective than non- 
specific feedback (i.e., the "correct" answer). How does this result relate 
to the finding by Martin and Gettys (1969) that probabilistic feedback is 
better than nominal feedback, or to the evidence from Wheeler and Beach (1968) 
and Peterson, DuCharme, and Edwards (1968) that subjects give more optimal 
P(H/D) estimates after they have received training in P(D/H)? If specific 
feedback enhances performance, why then did Pitz and Downing (1967) find that 
subjects 1 binary predictions were not improved by detailed information about 
the sampling distributions? 
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Diagnosticity. Both the Bayesian and the correlational models have 
well-defined measures of the diagnosticity of data — P(D/H) and e , . 
respectively, A unified approach to this topic seems natural. In the past, 
correlationalists have done little exploration in performance (non-learning) 
studies where diagnosticity was varied, Bayesian research on this topic has 
been extensive and has pointed up the difficulties subjects have in inte- 
grating probabilistic information. It is important to investigate the 
generality of these difficulties; the different data and respoiise formats 
possible within the correlational paradigm provide an excellent opportunity 
to do this, ' 

Decision aids . The idea of bootstrapping, which was developed in the 
context of regression equations, has some interesting relationships with the 
PIP system designed by Bayesians to improve human judgment. Both are Bayesian 
in spirit, inasmuch as they view human judgments as essential and attempt to 
blend them optimally (see Pankoff S Roberts, 1968, for an elaboration of this 
point). However, PIP assumes that the aggregation process is faulty and 
attempts to circumvent this by having subjects estimate P(D/H) values and 
letting a machine combine them. Bootstrapping assumes that subjects can 
aggregate information appropriately except for unreliability that must be 
filtered out. Actually, one could incorporate the bootstrapping notion 
intoa Pipsystem by having subjects make a series of posterior probability 
judgments from which their implicit P(D/H) opinions could be inferred. These 
inferred values could then be processed by Bayes 1 theorem. Alternatively, 
one could apply the PIP assumption to bootstrapping in a correlational 
framework by asking subjects to estimate the regression coefficients directly. 
The success of bootstrapping and PIP systems suggests that the assumptions 
of both are probably correct — judges are biased and unreliable in their 
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weighting of information. Perhaps a system can be designed to minimize both 
these sources of error.vor, at least* to differentiate situations where PIP 
might excell bootstrapping or vic,6-versa. 

Conclusions 

It is obvious that large gaps exist in our understanding of information 
processing in human judgment — despite several energetic research programs 
over the last decade. We hope that* in the future* researchers will not be 
bound unnecessarily by the constraints of one particular experimental para- 
digm* and will, instead* approach substantive problems with an awareness of 
the diverse array of models* methods* and tasks that are available. 
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